Qwen 3.5 MoE gives you massive-model brainpower at tiny-model speeds. Here is how to tune it for 100+ TPS on your local machine

Why Qwen 3.5 MoE is the Speed King

If you're building local AI agents, running RAG pipelines, or anything that needs fast inference, the Qwen 3.5 MoE (Mixture of Experts) is basically cheating. The MoE version is in a completely different league than the 2B dense version — think of it like the difference between a sports car and a hypercar.

Here's the deal: Qwen 3.5 MoE gives you massive-model brain power at tiny-model speeds.

What the Heck is MoE Anyway

Mixture of Experts (MoE)¹ sounds intimidating but is actually simple. Imagine a company with 14 departments, but only 2-3 of them show up to work on any given day. That's MoE.

The Qwen 3.5 MoE model has roughly 14 billion total parameters², but only activates about 2-3 billion of them for any single token generation. The rest sit at home and collect dust.

The MoE version activates roughly the same number of parameters as the 2B dense model, but it has access to way more expertise baked into those inactive parameters.

The Router: Traffic Cop for Your Tokens

Every MoE model has a router — a small neural network that decides which experts handle which tokens. It's like having a smart dispatch system that sends each query to the right specialist.

The router looks at each token and decides: "This one needs math reasoning, send it to Expert #7. This one is creative writing, Expert #3."

This is why MoE models can be so smart — they have specialists for everything, but only pay the compute cost for the ones they actually use.

Why MoE Matters

Model size does not linearly correlate with capability. A 14B MoE model can punch well above its weight because different experts specialize in different types of reasoning.

When you run the MoE model, the router plays traffic cop. For every token, it decides which experts should handle this.

The result: you get a model that knows way more than a 2B model should, but runs at speeds that would make a 2B model jealous on the right hardware.

MoE vs Dense: The Analogy

Think of it like a team of specialists versus a generalist:

Approach	Analogy	Pros	Cons
Dense (2B)	One person who knows a bit of everything	Efficient, consistent	Limited expertise
MoE (14B/3B)	14 specialists, only 2-3 show up daily	Way smarter, same speed	More complex

Speed: TTFT vs TPS

There are two different speed metrics that matter when evaluating LLM performance.

Time to First Token (TTFT)³

How long it takes for the model to spit out the first word after you hit enter. For MoE models, the router is stupid fast because it's a small, lightweight component.

TTFT matters most for:

Interactive chat applications
Real-time transcription
Any scenario where user is waiting for acknowledgment

Tokens Per Second (TPS)⁴

Once the first token drops, how fast do the rest come out? This is the number that matters most for throughput.

With Qwen 3.5 MoE on an RTX 4070 or better, you will see 100-150 TPS. On an RTX 4090? We're talking 180 plus TPS.

Why TPS matters:

Low TPS under 30: you will feel the lag, conversations feel stilted
Medium TPS 30-80: readable, good for casual chat with small responses
High TPS 80-150: smooth, the sweet spot for building agents⁵
Insane TPS over 150: almost feels like streaming in real time

Hardware Guide

RTX GPUs

Here's the breakdown for NVIDIA cards:

GPU	VRAM	Expected TPS	Notes
RTX 3060	12GB	40-60	Decent starter option
RTX 4070	12GB	80-110	Sweet spot
RTX 4070 Ti Super	16GB	100-130	Great upgrade
RTX 4090	24GB	150-200	Absolute monster
RTX 5090	32GB	180-250	Future proof 🚀

Apple Silicon

M-series Macs are great for this because they use unified memory⁶.

Chip	Memory	TPS	Notes
M3	24GB	50-80	Solid
M3 Pro	36GB	60-90	Great
M3 Max	64GB	80-120	Overkill
M4 Max	64GB+	90-130	New king

Linux / WSL2 Note

If you're running on Linux or WSL2, make sure you have the latest NVIDIA drivers. The compute capability of your GPU must support the model's requirements.

bash
1# Check your GPU
2nvidia-smi
3
4# Update drivers if needed
5sudo apt update && sudo apt upgrade nvidia-driver-535

Ollama Config

To really unleash Qwen 3.5 MoE, you need to tune it. The default settings are conservative — you can squeeze out way more performance.

Create a file called Modelfile:

dockerfile
1FROM qwen:3.5-2b-q4_K_M
2
3PARAMETER num_ctx 32768
4PARAMETER num_batch 512
5PARAMETER num_gpu 99
6PARAMETER flash_attn true
7PARAMETER temperature 0.1

Then apply it:

bash
1ollama create qwen35-moe-fast -f Modelfile
2ollama run qwen35-moe-fast

What Each Parameter Does

num_ctx⁷: controls how many tokens the model can see. More context means more memory. For most RAG workflows, 16k-32k is right.
num_batch⁸: how many tokens processed in parallel. Higher batch size means faster prefill but more memory.
flash_attn⁹: mathematically optimized attention. For long contexts, this is non-negotiable.
num_gpu: set to 99 to use all available.
temperature¹⁰: controls randomness. Lower = more deterministic, higher = more creative. For RAG, keep it low (0.1-0.3).

Practical Use Cases

1. Iterative Agents

If you're building agents that loop¹¹, a slow model means 10 think-and-respond cycles might take 2 minutes. The same cycles with Qwen 3.5 MoE take 10 seconds.

That's the difference between a tool that's usable and one that collects dust.

2. Large RAG Chunks

With Flash Attention and a properly tuned batch size, the prefill¹² on a 10k-token document happens in milliseconds.

This is huge for RAG because:

Users don't wait for context to load
More context = better answers
You can chunk larger documents

3. Real-Time Anything

High TPS makes everything feel native. Low TPS makes everything feel like waiting on an API call.

Think about it: at 150 TPS, a 500-token response takes 3.3 seconds. That feels like reading fast, not like waiting for an LLM.

Model Comparison

Here's how Qwen 3.5 MoE stacks up against the competition:

Model	Size	Active Params	TPS (RTX 4090)	Notes
Qwen 2.5 2B	2B	2B	50-60	Great baseline
Qwen 3.5 MoE	14B	2-3B	150-200	Speed king
Llama 3.2 3B	3B	3B	60-80	Solid but slower
Mistral 7B	7B	7B	40-50	Classic dense
Phi-4 4B	4B	4B	70-90	Microsoft's offering

Advanced Tuning Tips

KV Cache Optimization

The KV cache stores precomputed attention keys and values. More cache = faster generation:

dockerfile
1PARAMETER num_ctx 32768
2PARAMETER num_keep 32768

Batch Size Tuning

If you're running out of memory, lower the batch size:

PARAMETER num_batch 256  # Half of default

If you have VRAM to spare, bump it up:

PARAMETER num_batch 1024  # Crusher mode

Temperature Guidelines

Use Case	Temperature	Notes
Coding	0.0-0.1	Deterministic, correct
RAG Q&A	0.1-0.2	Factual, low hallucination
Creative writing	0.7-0.9	More varied output
Humor	0.8-1.0	Maximum chaos

TL;DR

🔄 MoE¹ means Different Experts with Same Compute — 14B total params, 2-3B active
🚀 TPS⁴ matters more than model size for real-world use
⚡ Flash Attention⁹ is mandatory for long contexts — it's 5-10x faster
🎮 RTX 4070+ is the sweet spot for local AI

Wrapping Up

Qwen 3.5 MoE is not perfect. The 2B dense model will always win on pure VRAM efficiency. But if you have the hardware to run it, the MoE version is faster and smarter.

The key is making sure your Ollama config is not the bottleneck. The default settings are there for compatibility — not performance.

Footnotes

❓ Frequently Asked Questions

Q: Why not just use the 2B dense model?

The 2B model is great for VRAM-constrained environments — it runs on 2GB of VRAM and fits in almost any GPU. However, if you have an RTX 4070 or better, the MoE version is strictly better: it has more knowledge (14B params worth) at roughly the same speed and only needs about 4-5GB of VRAM.

Think of it this way: you paid for the 4070, you might as well use it.

Q: Does MoE work on CPU?

Technically yes, but it's not worth it. The routing overhead actually makes MoE slower than dense on CPU. If you're running on CPU only, stick with the 2B or 3B dense models.

The MoE advantage only kicks in with GPU acceleration where the parallel computation pays off.

Q: How much VRAM do I actually need?

Here's the breakdown:

Model	Quantization	VRAM Needed
Qwen 2B	Q4_K_M	~2.5GB
Qwen 3.5 MoE	Q4_K_M	~4-5GB
Qwen 3.5 MoE	Q5_K_M	~6-7GB
Qwen 3.5 MoE	F16	~14GB

For most people, Q4_K_M is the sweet spot — almost indistinguishable quality, way less VRAM.

Q: Can I run this on a laptop?

Only if you have a gaming laptop with a RTX 4060 or better. Integrated graphics won't cut it — you need dedicated VRAM.

MacBooks with M3/M4 chips work great thanks to unified memory, but you'll need to allocate at least 18GB of RAM to the model.

Q: What's the difference between Qwen 3.5 MoE and Qwen 2.5?

Qwen 2.5 is the base model family, Qwen 3.5 is the latest generation with improved reasoning. The "MoE" variant uses the Mixture of Experts architecture to get more capability out of the same compute.

In short: Qwen 3.5 MoE = Qwen 2.5 base × expert routing.

Q: Is Ollama stable for production?

Ollama is great for development and local testing. For production deployments, consider using llama.cpp server mode or a proper inference server like vLLM. Ollama is basically a wrapper around llama.cpp with a convenient CLI.

For your local AI agent development, Ollama is perfect.

Q: How do I benchmark this properly?

Use the ollama benchmark command or measure with time in your code. Key metrics:

TTFT: Use time ollama run qwen35-moe-fast "prompt" and measure to first token
TPS: Count tokens in output divided by generation time
Actual VRAM: Check nvidia-smi during generation

Run at least 3 iterations and average — cold start times vary.

Q: What about other MoE models?

Qwen 3.5 MoE is currently the best for local inference. Other options:

Mixtral 8x7B: Good but needs ~48GB VRAM (not local-friendly)
DeepSeek V3: Competing with Qwen, newer
Gemma 2 MoE: Google's offering, still maturing

For local AI on consumer hardware, Qwen 3.5 MoE is the king.

Footnotes

MoE (Mixture of Experts) — A neural network architecture where multiple specialized "expert" networks exist, but only a subset is active for any given input. It's like having 14 specialists in a company, but only the relevant 2-3 show up for each task. ↩ ↩²
Parameters — The weights and biases in a neural network. Think of them as the "knowledge" the model has learned. More parameters generally means more knowledge, but also more compute cost. ↩
TTFT (Time to First Token) — The latency between sending a prompt and receiving the first generated token. Critical for interactive applications where users expect immediate feedback. ↩
TPS (Tokens Per Second) — The generation throughput after the first token. This determines how fast the model can produce long outputs. Also called "streaming throughput." ↩ ↩²
Agents — LLM-powered programs that can take multiple actions in sequence, like think → search → synthesize → respond. High TPS is critical because agents make many sequential calls. ↩
Unified Memory — Apple's architecture where CPU and GPU share the same RAM. This is more efficient than discrete GPUs because data doesn't need to cross the memory bus. ↩
num_ctx (Context Length) — How many tokens the model can "see" at once. For RAG, you typically want 16k-32k to load entire documents. ↩
num_batch (Batch Size) — How many tokens are processed simultaneously. Higher = faster prefill but more VRAM. Balance this with your GPU's memory. ↩
Flash Attention — A mathematically equivalent but faster attention mechanism. Uses less memory and computes faster by not materializing the full attention matrix. Essential for long contexts. ↩ ↩²
Temperature — A sampling parameter that controls randomness. 0 = always pick the most likely next token (deterministic). Higher = more random/creative. ↩
Iterative Agents — Agents that loop multiple times, thinking through a problem step by step. Tools like Claude Code and OpenAI's o1 use this pattern heavily. ↩
Prefill — The phase where the model processes your input before generating output. This is where Flash Attention shines — long inputs load almost instantly. ↩

Qwen 3.5 MoE - Speed King Explained

Why Qwen 3.5 MoE is the Speed King

What the Heck is MoE Anyway

The Router: Traffic Cop for Your Tokens

Why MoE Matters

MoE vs Dense: The Analogy

Speed: TTFT vs TPS

Time to First Token (TTFT)3

Tokens Per Second (TPS)4

Hardware Guide

RTX GPUs

Apple Silicon

Linux / WSL2 Note

Ollama Config

What Each Parameter Does

Practical Use Cases

1. Iterative Agents

2. Large RAG Chunks

3. Real-Time Anything

Model Comparison

Advanced Tuning Tips

KV Cache Optimization

Batch Size Tuning

Temperature Guidelines

TL;DR

Wrapping Up

Footnotes

❓ Frequently Asked Questions

Q: Why not just use the 2B dense model?

Q: Does MoE work on CPU?

Q: How much VRAM do I actually need?

Q: Can I run this on a laptop?

Q: What's the difference between Qwen 3.5 MoE and Qwen 2.5?

Q: Is Ollama stable for production?

Q: How do I benchmark this properly?

Q: What about other MoE models?

Footnotes

Time to First Token (TTFT)³

Tokens Per Second (TPS)⁴