

I find llama.cpp with Vulkan EXTREMELY reliable. I can have it running for days at once without a problem. As far as tokens/sec that’s that’s a complicated question because it depends on model, quant, sepculative, kv quant, context length, and card distribution. Generally:
Models’ typical speeds at deep context for agentic use. Simple chats will be faster
| Model | Quant | Prompt Processing (tok/s) | Token Generation (tok/s) | Hardware | Quality |
|---|---|---|---|---|---|
| Qwen 3.5 397B | Q2_K_M | 100-120 | 18-22 | 2 x 7900 + 4 x Mi50 | ★★★★★ |
| Gemma4 31B or Qwen3.5 27B | Q8_0 | 400-800 | 20-25 | 2 x 7900xtx | ★★★★ |
| Qwen 3.6 35B | Q5_K_M | 1000-2500 | 60-100 | 2 x 7900xtx | ★★★★ |
| Qwen 3.5 122B | Q4_0 | 200-300 | 30-35 | 4 x MI50 | ★★★★ |
| gpt-oss 120b | mxfp4 (native) | 500-800 | 50-60 | 3 x Mi50 | ★★ |
| Nemotron 3 Nano 30B | IQ3_K_XXS | 2500-3000 | 150-180 | 1 x 7900xtx | ★ |







I think if we lived in a sane world this would be a constant discussion in every corner of daily life