Best local LLMs for the Apple M2 Ultra (128GB) (2026)
The Apple M2 Ultra (128GB) has 96 GB of VRAM and 800 GB/s of memory bandwidth. That fits 25 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →
All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.
Fit grid by context length
| Model | 8K | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|
| Llama 3.1 8B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3.5 9B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3.5 4B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Gemma 3 4B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Mistral Nemo 12B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Phi 4 14B | Q8_0 | Q8_0 | — | — | — |
| Gemma 4 12B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3 14B | Q8_0 | Q8_0 | Q8_0 | — | — |
| Gemma 3 12B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| DeepSeek R1 Distill Qwen 14B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| GPT OSS 20B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Rocinante 12B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3.5 27B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3.5 35B A3B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3 Coder 30B A3B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Gemma 3 27B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Gemma 4 26B A4B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Mistral Small 3.2 24B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| DeepSeek R1 Distill Qwen 32B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Cydonia 24B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Llama 3.3 70B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q5_K_M |
| Llama 4 Scout | Q6_K | Q6_K | Q6_K | Q5_K_M | Q4_K_M |
| GPT OSS 120B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| GLM 4.5 Air | Q6_K | Q5_K_M | Q5_K_M | Q5_K_M | Q4_K_M |
| Qwen3.5 122B A10B | Q5_K_M | Q5_K_M | Q5_K_M | Q5_K_M | Q4_K_M |
| Qwen3.5 397B A17B | — | — | — | — | — |
| DeepSeek R1 | — | — | — | — | — |
| Mistral Large 3 | — | — | — | — | — |
| Kimi K2.5 | — | — | — | — | — |
| Anubis 70B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q5_K_M |
Fits on GPUExpert offloadPartial offloadCPU only
Top pick per use case
Coding · 32K
Qwen3 Coder 30B A3B Q8_0
Fits on GPU · ≈ 90 tok/s
Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.
Roleplay & writing · 16K
Rocinante 12B Q8_0
Fits on GPU · ≈ 40 tok/s
The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.
Summarization · 32K
Qwen3.5 27B Q8_0
Fits on GPU · ≈ 18 tok/s
The dense 24GB workhorse. If you want one model on a 3090 and no surprises, it's this.
RAG & documents · 16K
Qwen3.5 35B A3B Q8_0
Fits on GPU · ≈ 87 tok/s
The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.
Vision / image input · 16K
Gemma 3 27B Q8_0
Fits on GPU · ≈ 18 tok/s
Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.
Almost fits
These models can't run well on 96 GB at 32K: Phi 4 14B, Qwen3.5 397B A17B, DeepSeek R1, Mistral Large 3, Kimi K2.5.
Frequently asked questions
What is the best local LLM for a Apple M2 Ultra (128GB) in 2026?
Qwen3 Coder 30B A3B is our top overall pick on the Apple M2 Ultra (128GB): Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.
How many local LLMs fit in 96 GB of VRAM?
At Q4_K_M quantization and 32K context, 25 of our 30 tracked models fit entirely in the Apple M2 Ultra (128GB)'s 96 GB of VRAM, and 0 more MoE models run via expert offload with enough system RAM.
Can a Apple M2 Ultra (128GB) run a 70B model like Llama 3.3?
Yes — Llama 3.3 70B runs on the Apple M2 Ultra (128GB) as "Fits on GPU" at Q8_0, around 7 tokens/sec.
Can a Apple M2 Ultra (128GB) run DeepSeek R1?
Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.
How much VRAM do I need for 32K context?
The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.