dexiio

Best local LLMs for the Apple M2 Ultra (128GB) (2026)

The Apple M2 Ultra (128GB) has 96 GB of VRAM and 800 GB/s of memory bandwidth. That fits 25 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

Fit grid by context length

Model8K16K32K64K128K
Llama 3.1 8BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3.5 9BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3.5 4BQ8_0Q8_0Q8_0Q8_0Q8_0
Gemma 3 4BQ8_0Q8_0Q8_0Q8_0Q8_0
Mistral Nemo 12BQ8_0Q8_0Q8_0Q8_0Q8_0
Phi 4 14BQ8_0Q8_0
Gemma 4 12BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3 14BQ8_0Q8_0Q8_0
Gemma 3 12BQ8_0Q8_0Q8_0Q8_0Q8_0
DeepSeek R1 Distill Qwen 14BQ8_0Q8_0Q8_0Q8_0Q8_0
GPT OSS 20BQ8_0Q8_0Q8_0Q8_0Q8_0
Rocinante 12BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3.5 27BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3.5 35B A3BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3 Coder 30B A3BQ8_0Q8_0Q8_0Q8_0Q8_0
Gemma 3 27BQ8_0Q8_0Q8_0Q8_0Q8_0
Gemma 4 26B A4BQ8_0Q8_0Q8_0Q8_0Q8_0
Mistral Small 3.2 24BQ8_0Q8_0Q8_0Q8_0Q8_0
DeepSeek R1 Distill Qwen 32BQ8_0Q8_0Q8_0Q8_0Q8_0
Cydonia 24BQ8_0Q8_0Q8_0Q8_0Q8_0
Llama 3.3 70BQ8_0Q8_0Q8_0Q8_0Q5_K_M
Llama 4 ScoutQ6_KQ6_KQ6_KQ5_K_MQ4_K_M
GPT OSS 120BQ8_0Q8_0Q8_0Q8_0Q8_0
GLM 4.5 AirQ6_KQ5_K_MQ5_K_MQ5_K_MQ4_K_M
Qwen3.5 122B A10BQ5_K_MQ5_K_MQ5_K_MQ5_K_MQ4_K_M
Qwen3.5 397B A17B
DeepSeek R1
Mistral Large 3
Kimi K2.5
Anubis 70BQ8_0Q8_0Q8_0Q8_0Q5_K_M

Fits on GPUExpert offloadPartial offloadCPU only

Top pick per use case

Coding · 32K

Qwen3 Coder 30B A3B Q8_0

Fits on GPU · ≈ 90 tok/s

Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

Roleplay & writing · 16K

Rocinante 12B Q8_0

Fits on GPU · ≈ 40 tok/s

The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.

Summarization · 32K

Qwen3.5 27B Q8_0

Fits on GPU · ≈ 18 tok/s

The dense 24GB workhorse. If you want one model on a 3090 and no surprises, it's this.

RAG & documents · 16K

Qwen3.5 35B A3B Q8_0

Fits on GPU · ≈ 87 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Vision / image input · 16K

Gemma 3 27B Q8_0

Fits on GPU · ≈ 18 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

Almost fits

These models can't run well on 96 GB at 32K: Phi 4 14B, Qwen3.5 397B A17B, DeepSeek R1, Mistral Large 3, Kimi K2.5.

Frequently asked questions

What is the best local LLM for a Apple M2 Ultra (128GB) in 2026?

Qwen3 Coder 30B A3B is our top overall pick on the Apple M2 Ultra (128GB): Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

How many local LLMs fit in 96 GB of VRAM?

At Q4_K_M quantization and 32K context, 25 of our 30 tracked models fit entirely in the Apple M2 Ultra (128GB)'s 96 GB of VRAM, and 0 more MoE models run via expert offload with enough system RAM.

Can a Apple M2 Ultra (128GB) run a 70B model like Llama 3.3?

Yes — Llama 3.3 70B runs on the Apple M2 Ultra (128GB) as "Fits on GPU" at Q8_0, around 7 tokens/sec.

Can a Apple M2 Ultra (128GB) run DeepSeek R1?

Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.

How much VRAM do I need for 32K context?

The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.

Related guides