← Back to Blog

Qwen 3.5 MoE - Speed King Explained

aillmollama

Qwen 3.5 MoE - Speed King Explained

April 19, 2026
4,821 views
5.0
Paál Gyula
Paál Gyula
Founder & Lead Architect

Qwen 3.5 MoE gives you massive-model brainpower at tiny-model speeds. Here is how to tune it for 100+ TPS on your local machine


Why Qwen 3.5 MoE is the Speed King

If you're building local AI agents, running RAG pipelines, or anything that needs fast inference, the Qwen 3.5 MoE (Mixture of Experts) is basically cheating. The MoE version is in a completely different league than the 2B dense version — think of it like the difference between a sports car and a hypercar.

Here's the deal: Qwen 3.5 MoE gives you massive-model brain power at tiny-model speeds.

What the Heck is MoE Anyway

Mixture of Experts (MoE)1 sounds intimidating but is actually simple. Imagine a company with 14 departments, but only 2-3 of them show up to work on any given day. That's MoE.

The Qwen 3.5 MoE model has roughly 14 billion total parameters2, but only activates about 2-3 billion of them for any single token generation. The rest sit at home and collect dust.

The MoE version activates roughly the same number of parameters as the 2B dense model, but it has access to way more expertise baked into those inactive parameters.

The Router: Traffic Cop for Your Tokens

Every MoE model has a router — a small neural network that decides which experts handle which tokens. It's like having a smart dispatch system that sends each query to the right specialist.

The router looks at each token and decides: "This one needs math reasoning, send it to Expert #7. This one is creative writing, Expert #3."

This is why MoE models can be so smart — they have specialists for everything, but only pay the compute cost for the ones they actually use.

Why MoE Matters

Model size does not linearly correlate with capability. A 14B MoE model can punch well above its weight because different experts specialize in different types of reasoning.

When you run the MoE model, the router plays traffic cop. For every token, it decides which experts should handle this.

The result: you get a model that knows way more than a 2B model should, but runs at speeds that would make a 2B model jealous on the right hardware.

MoE vs Dense: The Analogy

Think of it like a team of specialists versus a generalist:

ApproachAnalogyProsCons
Dense (2B)One person who knows a bit of everythingEfficient, consistentLimited expertise
MoE (14B/3B)14 specialists, only 2-3 show up dailyWay smarter, same speedMore complex

Speed: TTFT vs TPS

There are two different speed metrics that matter when evaluating LLM performance.

Time to First Token (TTFT)3

How long it takes for the model to spit out the first word after you hit enter. For MoE models, the router is stupid fast because it's a small, lightweight component.

TTFT matters most for:

  • Interactive chat applications
  • Real-time transcription
  • Any scenario where user is waiting for acknowledgment

Tokens Per Second (TPS)4

Once the first token drops, how fast do the rest come out? This is the number that matters most for throughput.

With Qwen 3.5 MoE on an RTX 4070 or better, you will see 100-150 TPS. On an RTX 4090? We're talking 180 plus TPS.

Why TPS matters:

  • Low TPS under 30: you will feel the lag, conversations feel stilted
  • Medium TPS 30-80: readable, good for casual chat with small responses
  • High TPS 80-150: smooth, the sweet spot for building agents5
  • Insane TPS over 150: almost feels like streaming in real time

Hardware Guide

RTX GPUs

Here's the breakdown for NVIDIA cards:

GPUVRAMExpected TPSNotes
RTX 306012GB40-60Decent starter option
RTX 407012GB80-110Sweet spot
RTX 4070 Ti Super16GB100-130Great upgrade
RTX 409024GB150-200Absolute monster
RTX 509032GB180-250Future proof 🚀

Apple Silicon

M-series Macs are great for this because they use unified memory6.

ChipMemoryTPSNotes
M324GB50-80Solid
M3 Pro36GB60-90Great
M3 Max64GB80-120Overkill
M4 Max64GB+90-130New king

Linux / WSL2 Note

If you're running on Linux or WSL2, make sure you have the latest NVIDIA drivers. The compute capability of your GPU must support the model's requirements.

bash
1# Check your GPU
2nvidia-smi
3
4# Update drivers if needed
5sudo apt update && sudo apt upgrade nvidia-driver-535

Ollama Config

To really unleash Qwen 3.5 MoE, you need to tune it. The default settings are conservative — you can squeeze out way more performance.

Create a file called Modelfile:

dockerfile
1FROM qwen:3.5-2b-q4_K_M
2
3PARAMETER num_ctx 32768
4PARAMETER num_batch 512
5PARAMETER num_gpu 99
6PARAMETER flash_attn true
7PARAMETER temperature 0.1

Then apply it:

bash
1ollama create qwen35-moe-fast -f Modelfile
2ollama run qwen35-moe-fast

What Each Parameter Does

  • num_ctx7: controls how many tokens the model can see. More context means more memory. For most RAG workflows, 16k-32k is right.
  • num_batch8: how many tokens processed in parallel. Higher batch size means faster prefill but more memory.
  • flash_attn9: mathematically optimized attention. For long contexts, this is non-negotiable.
  • num_gpu: set to 99 to use all available.
  • temperature10: controls randomness. Lower = more deterministic, higher = more creative. For RAG, keep it low (0.1-0.3).

Practical Use Cases

1. Iterative Agents

If you're building agents that loop11, a slow model means 10 think-and-respond cycles might take 2 minutes. The same cycles with Qwen 3.5 MoE take 10 seconds.

That's the difference between a tool that's usable and one that collects dust.

2. Large RAG Chunks

With Flash Attention and a properly tuned batch size, the prefill12 on a 10k-token document happens in milliseconds.

This is huge for RAG because:

  • Users don't wait for context to load
  • More context = better answers
  • You can chunk larger documents

3. Real-Time Anything

High TPS makes everything feel native. Low TPS makes everything feel like waiting on an API call.

Think about it: at 150 TPS, a 500-token response takes 3.3 seconds. That feels like reading fast, not like waiting for an LLM.

Model Comparison

Here's how Qwen 3.5 MoE stacks up against the competition:

ModelSizeActive ParamsTPS (RTX 4090)Notes
Qwen 2.5 2B2B2B50-60Great baseline
Qwen 3.5 MoE14B2-3B150-200Speed king
Llama 3.2 3B3B3B60-80Solid but slower
Mistral 7B7B7B40-50Classic dense
Phi-4 4B4B4B70-90Microsoft's offering

Advanced Tuning Tips

KV Cache Optimization

The KV cache stores precomputed attention keys and values. More cache = faster generation:

dockerfile
1PARAMETER num_ctx 32768
2PARAMETER num_keep 32768

Batch Size Tuning

If you're running out of memory, lower the batch size:

PARAMETER num_batch 256  # Half of default

If you have VRAM to spare, bump it up:

PARAMETER num_batch 1024  # Crusher mode

Temperature Guidelines

Use CaseTemperatureNotes
Coding0.0-0.1Deterministic, correct
RAG Q&A0.1-0.2Factual, low hallucination
Creative writing0.7-0.9More varied output
Humor0.8-1.0Maximum chaos

TL;DR

  • 🔄 MoE1 means Different Experts with Same Compute — 14B total params, 2-3B active
  • 🚀 TPS4 matters more than model size for real-world use
  • Flash Attention9 is mandatory for long contexts — it's 5-10x faster
  • 🎮 RTX 4070+ is the sweet spot for local AI

Wrapping Up

Qwen 3.5 MoE is not perfect. The 2B dense model will always win on pure VRAM efficiency. But if you have the hardware to run it, the MoE version is faster and smarter.

The key is making sure your Ollama config is not the bottleneck. The default settings are there for compatibility — not performance.


Footnotes

❓ Frequently Asked Questions

Q: Why not just use the 2B dense model?

A:

The 2B model is great for VRAM-constrained environments — it runs on 2GB of VRAM and fits in almost any GPU. However, if you have an RTX 4070 or better, the MoE version is strictly better: it has more knowledge (14B params worth) at roughly the same speed and only needs about 4-5GB of VRAM.

Think of it this way: you paid for the 4070, you might as well use it.

Q: Does MoE work on CPU?

A:

Technically yes, but it's not worth it. The routing overhead actually makes MoE slower than dense on CPU. If you're running on CPU only, stick with the 2B or 3B dense models.

The MoE advantage only kicks in with GPU acceleration where the parallel computation pays off.

Q: How much VRAM do I actually need?

A:

Here's the breakdown:

ModelQuantizationVRAM Needed
Qwen 2BQ4_K_M~2.5GB
Qwen 3.5 MoEQ4_K_M~4-5GB
Qwen 3.5 MoEQ5_K_M~6-7GB
Qwen 3.5 MoEF16~14GB

For most people, Q4_K_M is the sweet spot — almost indistinguishable quality, way less VRAM.

Q: Can I run this on a laptop?

A:

Only if you have a gaming laptop with a RTX 4060 or better. Integrated graphics won't cut it — you need dedicated VRAM.

MacBooks with M3/M4 chips work great thanks to unified memory, but you'll need to allocate at least 18GB of RAM to the model.

Q: What's the difference between Qwen 3.5 MoE and Qwen 2.5?

A:

Qwen 2.5 is the base model family, Qwen 3.5 is the latest generation with improved reasoning. The "MoE" variant uses the Mixture of Experts architecture to get more capability out of the same compute.

In short: Qwen 3.5 MoE = Qwen 2.5 base × expert routing.

Q: Is Ollama stable for production?

A:

Ollama is great for development and local testing. For production deployments, consider using llama.cpp server mode or a proper inference server like vLLM. Ollama is basically a wrapper around llama.cpp with a convenient CLI.

For your local AI agent development, Ollama is perfect.

Q: How do I benchmark this properly?

A:

Use the ollama benchmark command or measure with time in your code. Key metrics:

  • TTFT: Use time ollama run qwen35-moe-fast "prompt" and measure to first token
  • TPS: Count tokens in output divided by generation time
  • Actual VRAM: Check nvidia-smi during generation

Run at least 3 iterations and average — cold start times vary.

Q: What about other MoE models?

A:

Qwen 3.5 MoE is currently the best for local inference. Other options:

  • Mixtral 8x7B: Good but needs ~48GB VRAM (not local-friendly)
  • DeepSeek V3: Competing with Qwen, newer
  • Gemma 2 MoE: Google's offering, still maturing

For local AI on consumer hardware, Qwen 3.5 MoE is the king.

Have questions about this topic?

We are happy to discuss your specific needs. Whether you need architecture advice, implementation guidance, or just want to explore possibilities.

Let's Talk

Footnotes

  1. MoE (Mixture of Experts) — A neural network architecture where multiple specialized "expert" networks exist, but only a subset is active for any given input. It's like having 14 specialists in a company, but only the relevant 2-3 show up for each task. 2

  2. Parameters — The weights and biases in a neural network. Think of them as the "knowledge" the model has learned. More parameters generally means more knowledge, but also more compute cost.

  3. TTFT (Time to First Token) — The latency between sending a prompt and receiving the first generated token. Critical for interactive applications where users expect immediate feedback.

  4. TPS (Tokens Per Second) — The generation throughput after the first token. This determines how fast the model can produce long outputs. Also called "streaming throughput." 2

  5. Agents — LLM-powered programs that can take multiple actions in sequence, like think → search → synthesize → respond. High TPS is critical because agents make many sequential calls.

  6. Unified Memory — Apple's architecture where CPU and GPU share the same RAM. This is more efficient than discrete GPUs because data doesn't need to cross the memory bus.

  7. num_ctx (Context Length) — How many tokens the model can "see" at once. For RAG, you typically want 16k-32k to load entire documents.

  8. num_batch (Batch Size) — How many tokens are processed simultaneously. Higher = faster prefill but more VRAM. Balance this with your GPU's memory.

  9. Flash Attention — A mathematically equivalent but faster attention mechanism. Uses less memory and computes faster by not materializing the full attention matrix. Essential for long contexts. 2

  10. Temperature — A sampling parameter that controls randomness. 0 = always pick the most likely next token (deterministic). Higher = more random/creative.

  11. Iterative Agents — Agents that loop multiple times, thinking through a problem step by step. Tools like Claude Code and OpenAI's o1 use this pattern heavily.

  12. Prefill — The phase where the model processes your input before generating output. This is where Flash Attention shines — long inputs load almost instantly.

Follow us
All Rights Reserved
© 2011-2026
Progressive Innovation
LAB