Qwen 3.5 MoE - Speed King Explained
Qwen 3.5 MoE - Speed King Explained
Qwen 3.5 MoE gives you massive-model brainpower at tiny-model speeds. Here is how to tune it for 100+ TPS on your local machine
Why Qwen 3.5 MoE is the Speed King
If you're building local AI agents, running RAG pipelines, or anything that needs fast inference, the Qwen 3.5 MoE (Mixture of Experts) is basically cheating. The MoE version is in a completely different league than the 2B dense version — think of it like the difference between a sports car and a hypercar.
Here's the deal: Qwen 3.5 MoE gives you massive-model brain power at tiny-model speeds.
What the Heck is MoE Anyway
Mixture of Experts (MoE)1 sounds intimidating but is actually simple. Imagine a company with 14 departments, but only 2-3 of them show up to work on any given day. That's MoE.
The Qwen 3.5 MoE model has roughly 14 billion total parameters2, but only activates about 2-3 billion of them for any single token generation. The rest sit at home and collect dust.
The MoE version activates roughly the same number of parameters as the 2B dense model, but it has access to way more expertise baked into those inactive parameters.
The Router: Traffic Cop for Your Tokens
Every MoE model has a router — a small neural network that decides which experts handle which tokens. It's like having a smart dispatch system that sends each query to the right specialist.
The router looks at each token and decides: "This one needs math reasoning, send it to Expert #7. This one is creative writing, Expert #3."
This is why MoE models can be so smart — they have specialists for everything, but only pay the compute cost for the ones they actually use.
Why MoE Matters
Model size does not linearly correlate with capability. A 14B MoE model can punch well above its weight because different experts specialize in different types of reasoning.
When you run the MoE model, the router plays traffic cop. For every token, it decides which experts should handle this.
The result: you get a model that knows way more than a 2B model should, but runs at speeds that would make a 2B model jealous on the right hardware.
MoE vs Dense: The Analogy
Think of it like a team of specialists versus a generalist:
| Approach | Analogy | Pros | Cons |
|---|---|---|---|
| Dense (2B) | One person who knows a bit of everything | Efficient, consistent | Limited expertise |
| MoE (14B/3B) | 14 specialists, only 2-3 show up daily | Way smarter, same speed | More complex |
Speed: TTFT vs TPS
There are two different speed metrics that matter when evaluating LLM performance.
Time to First Token (TTFT)3
How long it takes for the model to spit out the first word after you hit enter. For MoE models, the router is stupid fast because it's a small, lightweight component.
TTFT matters most for:
- Interactive chat applications
- Real-time transcription
- Any scenario where user is waiting for acknowledgment
Tokens Per Second (TPS)4
Once the first token drops, how fast do the rest come out? This is the number that matters most for throughput.
With Qwen 3.5 MoE on an RTX 4070 or better, you will see 100-150 TPS. On an RTX 4090? We're talking 180 plus TPS.
Why TPS matters:
- Low TPS under 30: you will feel the lag, conversations feel stilted
- Medium TPS 30-80: readable, good for casual chat with small responses
- High TPS 80-150: smooth, the sweet spot for building agents5
- Insane TPS over 150: almost feels like streaming in real time
Hardware Guide
RTX GPUs
Here's the breakdown for NVIDIA cards:
| GPU | VRAM | Expected TPS | Notes |
|---|---|---|---|
| RTX 3060 | 12GB | 40-60 | Decent starter option |
| RTX 4070 | 12GB | 80-110 | Sweet spot |
| RTX 4070 Ti Super | 16GB | 100-130 | Great upgrade |
| RTX 4090 | 24GB | 150-200 | Absolute monster |
| RTX 5090 | 32GB | 180-250 | Future proof 🚀 |
Apple Silicon
M-series Macs are great for this because they use unified memory6.
| Chip | Memory | TPS | Notes |
|---|---|---|---|
| M3 | 24GB | 50-80 | Solid |
| M3 Pro | 36GB | 60-90 | Great |
| M3 Max | 64GB | 80-120 | Overkill |
| M4 Max | 64GB+ | 90-130 | New king |
Linux / WSL2 Note
If you're running on Linux or WSL2, make sure you have the latest NVIDIA drivers. The compute capability of your GPU must support the model's requirements.
bash1# Check your GPU 2nvidia-smi 3 4# Update drivers if needed 5sudo apt update && sudo apt upgrade nvidia-driver-535
Ollama Config
To really unleash Qwen 3.5 MoE, you need to tune it. The default settings are conservative — you can squeeze out way more performance.
Create a file called Modelfile:
dockerfile1FROM qwen:3.5-2b-q4_K_M 2 3PARAMETER num_ctx 32768 4PARAMETER num_batch 512 5PARAMETER num_gpu 99 6PARAMETER flash_attn true 7PARAMETER temperature 0.1
Then apply it:
bash1ollama create qwen35-moe-fast -f Modelfile 2ollama run qwen35-moe-fast
What Each Parameter Does
- num_ctx7: controls how many tokens the model can see. More context means more memory. For most RAG workflows, 16k-32k is right.
- num_batch8: how many tokens processed in parallel. Higher batch size means faster prefill but more memory.
- flash_attn9: mathematically optimized attention. For long contexts, this is non-negotiable.
- num_gpu: set to 99 to use all available.
- temperature10: controls randomness. Lower = more deterministic, higher = more creative. For RAG, keep it low (0.1-0.3).
Practical Use Cases
1. Iterative Agents
If you're building agents that loop11, a slow model means 10 think-and-respond cycles might take 2 minutes. The same cycles with Qwen 3.5 MoE take 10 seconds.
That's the difference between a tool that's usable and one that collects dust.
2. Large RAG Chunks
With Flash Attention and a properly tuned batch size, the prefill12 on a 10k-token document happens in milliseconds.
This is huge for RAG because:
- Users don't wait for context to load
- More context = better answers
- You can chunk larger documents
3. Real-Time Anything
High TPS makes everything feel native. Low TPS makes everything feel like waiting on an API call.
Think about it: at 150 TPS, a 500-token response takes 3.3 seconds. That feels like reading fast, not like waiting for an LLM.
Model Comparison
Here's how Qwen 3.5 MoE stacks up against the competition:
| Model | Size | Active Params | TPS (RTX 4090) | Notes |
|---|---|---|---|---|
| Qwen 2.5 2B | 2B | 2B | 50-60 | Great baseline |
| Qwen 3.5 MoE | 14B | 2-3B | 150-200 | Speed king |
| Llama 3.2 3B | 3B | 3B | 60-80 | Solid but slower |
| Mistral 7B | 7B | 7B | 40-50 | Classic dense |
| Phi-4 4B | 4B | 4B | 70-90 | Microsoft's offering |
Advanced Tuning Tips
KV Cache Optimization
The KV cache stores precomputed attention keys and values. More cache = faster generation:
dockerfile1PARAMETER num_ctx 32768 2PARAMETER num_keep 32768
Batch Size Tuning
If you're running out of memory, lower the batch size:
PARAMETER num_batch 256 # Half of default
If you have VRAM to spare, bump it up:
PARAMETER num_batch 1024 # Crusher mode
Temperature Guidelines
| Use Case | Temperature | Notes |
|---|---|---|
| Coding | 0.0-0.1 | Deterministic, correct |
| RAG Q&A | 0.1-0.2 | Factual, low hallucination |
| Creative writing | 0.7-0.9 | More varied output |
| Humor | 0.8-1.0 | Maximum chaos |
TL;DR
- 🔄 MoE1 means Different Experts with Same Compute — 14B total params, 2-3B active
- 🚀 TPS4 matters more than model size for real-world use
- ⚡ Flash Attention9 is mandatory for long contexts — it's 5-10x faster
- 🎮 RTX 4070+ is the sweet spot for local AI
Wrapping Up
Qwen 3.5 MoE is not perfect. The 2B dense model will always win on pure VRAM efficiency. But if you have the hardware to run it, the MoE version is faster and smarter.
The key is making sure your Ollama config is not the bottleneck. The default settings are there for compatibility — not performance.
Footnotes
❓ Frequently Asked Questions
Q: Why not just use the 2B dense model?
The 2B model is great for VRAM-constrained environments — it runs on 2GB of VRAM and fits in almost any GPU. However, if you have an RTX 4070 or better, the MoE version is strictly better: it has more knowledge (14B params worth) at roughly the same speed and only needs about 4-5GB of VRAM.
Think of it this way: you paid for the 4070, you might as well use it.
Q: Does MoE work on CPU?
Technically yes, but it's not worth it. The routing overhead actually makes MoE slower than dense on CPU. If you're running on CPU only, stick with the 2B or 3B dense models.
The MoE advantage only kicks in with GPU acceleration where the parallel computation pays off.
Q: How much VRAM do I actually need?
Here's the breakdown:
| Model | Quantization | VRAM Needed |
|---|---|---|
| Qwen 2B | Q4_K_M | ~2.5GB |
| Qwen 3.5 MoE | Q4_K_M | ~4-5GB |
| Qwen 3.5 MoE | Q5_K_M | ~6-7GB |
| Qwen 3.5 MoE | F16 | ~14GB |
For most people, Q4_K_M is the sweet spot — almost indistinguishable quality, way less VRAM.
Q: Can I run this on a laptop?
Only if you have a gaming laptop with a RTX 4060 or better. Integrated graphics won't cut it — you need dedicated VRAM.
MacBooks with M3/M4 chips work great thanks to unified memory, but you'll need to allocate at least 18GB of RAM to the model.
Q: What's the difference between Qwen 3.5 MoE and Qwen 2.5?
Qwen 2.5 is the base model family, Qwen 3.5 is the latest generation with improved reasoning. The "MoE" variant uses the Mixture of Experts architecture to get more capability out of the same compute.
In short: Qwen 3.5 MoE = Qwen 2.5 base × expert routing.
Q: Is Ollama stable for production?
Ollama is great for development and local testing. For production deployments, consider using llama.cpp server mode or a proper inference server like vLLM. Ollama is basically a wrapper around llama.cpp with a convenient CLI.
For your local AI agent development, Ollama is perfect.
Q: How do I benchmark this properly?
Use the ollama benchmark command or measure with time in your code. Key metrics:
- TTFT: Use
time ollama run qwen35-moe-fast "prompt"and measure to first token - TPS: Count tokens in output divided by generation time
- Actual VRAM: Check
nvidia-smiduring generation
Run at least 3 iterations and average — cold start times vary.
Q: What about other MoE models?
Qwen 3.5 MoE is currently the best for local inference. Other options:
- Mixtral 8x7B: Good but needs ~48GB VRAM (not local-friendly)
- DeepSeek V3: Competing with Qwen, newer
- Gemma 2 MoE: Google's offering, still maturing
For local AI on consumer hardware, Qwen 3.5 MoE is the king.
Have questions about this topic?
We are happy to discuss your specific needs. Whether you need architecture advice, implementation guidance, or just want to explore possibilities.
Let's TalkFootnotes
-
MoE (Mixture of Experts) — A neural network architecture where multiple specialized "expert" networks exist, but only a subset is active for any given input. It's like having 14 specialists in a company, but only the relevant 2-3 show up for each task. ↩ ↩2
-
Parameters — The weights and biases in a neural network. Think of them as the "knowledge" the model has learned. More parameters generally means more knowledge, but also more compute cost. ↩
-
TTFT (Time to First Token) — The latency between sending a prompt and receiving the first generated token. Critical for interactive applications where users expect immediate feedback. ↩
-
TPS (Tokens Per Second) — The generation throughput after the first token. This determines how fast the model can produce long outputs. Also called "streaming throughput." ↩ ↩2
-
Agents — LLM-powered programs that can take multiple actions in sequence, like think → search → synthesize → respond. High TPS is critical because agents make many sequential calls. ↩
-
Unified Memory — Apple's architecture where CPU and GPU share the same RAM. This is more efficient than discrete GPUs because data doesn't need to cross the memory bus. ↩
-
num_ctx (Context Length) — How many tokens the model can "see" at once. For RAG, you typically want 16k-32k to load entire documents. ↩
-
num_batch (Batch Size) — How many tokens are processed simultaneously. Higher = faster prefill but more VRAM. Balance this with your GPU's memory. ↩
-
Flash Attention — A mathematically equivalent but faster attention mechanism. Uses less memory and computes faster by not materializing the full attention matrix. Essential for long contexts. ↩ ↩2
-
Temperature — A sampling parameter that controls randomness. 0 = always pick the most likely next token (deterministic). Higher = more random/creative. ↩
-
Iterative Agents — Agents that loop multiple times, thinking through a problem step by step. Tools like Claude Code and OpenAI's o1 use this pattern heavily. ↩
-
Prefill — The phase where the model processes your input before generating output. This is where Flash Attention shines — long inputs load almost instantly. ↩