dexiio

Best local LLMs for the AMD Radeon RX 7900 XT (2026)

The AMD Radeon RX 7900 XT has 20 GB of VRAM and 800 GB/s of memory bandwidth. That fits 11 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context, and 6 more via MoE expert offload. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

Fit grid by context length

Model8K16K32K64K128K
Llama 3.1 8BQ8_0Q8_0Q8_0Q8_0IQ4_XS
Qwen3.5 9BQ8_0Q8_0Q8_0Q8_0IQ4_XS
Qwen3.5 4BQ8_0Q8_0Q8_0Q8_0IQ4_XS
Gemma 3 4BQ8_0Q8_0Q8_0Q8_0Q8_0
Mistral Nemo 12BQ8_0Q8_0Q8_0Q4_K_MQ8_0
Phi 4 14BQ8_0Q8_0
Gemma 4 12BQ8_0Q8_0Q8_0Q8_0Q5_K_M
Qwen3 14BQ8_0Q8_0Q6_K
Gemma 3 12BQ8_0Q8_0Q8_0Q8_0Q5_K_M
DeepSeek R1 Distill Qwen 14BQ8_0Q8_0Q6_KQ4_K_MQ8_0
GPT OSS 20BQ8_0Q8_0Q8_0Q8_0Q8_0
Rocinante 12BQ8_0Q8_0Q8_0Q4_K_MQ8_0
Qwen3.5 27BQ4_K_MIQ4_XSIQ4_XSIQ4_XSQ5_K_M
Qwen3.5 35B A3BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3 Coder 30B A3BQ4_K_MIQ4_XSQ8_0Q8_0Q8_0
Gemma 3 27BQ4_K_MQ4_K_MIQ4_XSIQ4_XSIQ4_XS
Gemma 4 26B A4BQ4_K_MIQ4_XSQ8_0Q6_KQ5_K_M
Mistral Small 3.2 24BQ5_K_MQ5_K_MIQ4_XSIQ4_XSQ8_0
DeepSeek R1 Distill Qwen 32BQ4_K_MQ4_K_MQ4_K_MQ4_K_MQ4_K_M
Cydonia 24BQ5_K_MQ5_K_MQ4_K_MQ4_K_MQ8_0
Llama 3.3 70BIQ4_XSIQ4_XSIQ4_XS
Llama 4 ScoutQ4_K_MQ4_K_MQ4_K_MIQ4_XS
GPT OSS 120BQ8_0Q8_0Q8_0Q8_0Q8_0
GLM 4.5 AirIQ4_XSIQ4_XSIQ4_XSIQ4_XS
Qwen3.5 122B A10BQ4_K_MQ4_K_MQ4_K_MQ4_K_M
Qwen3.5 397B A17B
DeepSeek R1
Mistral Large 3
Kimi K2.5
Anubis 70BIQ4_XSIQ4_XSIQ4_XS

Fits on GPUExpert offloadPartial offloadCPU only

Top pick per use case

Coding · 32K

Qwen3 Coder 30B A3B Q8_0

Expert offload · ≈ 14 tok/s

Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

Roleplay & writing · 16K

Rocinante 12B Q8_0

Fits on GPU · ≈ 40 tok/s

The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.

Summarization · 32K

Gemma 3 27B IQ4_XS

Fits on GPU · ≈ 35 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

RAG & documents · 16K

Qwen3.5 35B A3B Q8_0

Expert offload · ≈ 13 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Vision / image input · 16K

Gemma 3 27B Q4_K_M

Fits on GPU · ≈ 31 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

Almost fits

These models can't run well on 20 GB at 32K: Phi 4 14B, Qwen3.5 27B, DeepSeek R1 Distill Qwen 32B, Cydonia 24B, Llama 3.3 70B, Qwen3.5 122B A10B and 5 more.

What an upgrade unlocks

Stepping up to a Apple M2 Ultra (128GB) (96 GB) unlocks 6 more models on GPU or expert offload at 32K, including Qwen3.5 27B, DeepSeek R1 Distill Qwen 32B, Cydonia 24B.

Frequently asked questions

What is the best local LLM for a AMD Radeon RX 7900 XT in 2026?

Qwen3 Coder 30B A3B is our top overall pick on the AMD Radeon RX 7900 XT: Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

How many local LLMs fit in 20 GB of VRAM?

At Q4_K_M quantization and 32K context, 11 of our 30 tracked models fit entirely in the AMD Radeon RX 7900 XT's 20 GB of VRAM, and 6 more MoE models run via expert offload with enough system RAM.

Can a AMD Radeon RX 7900 XT run a 70B model like Llama 3.3?

Yes — Llama 3.3 70B runs on the AMD Radeon RX 7900 XT as "Partial offload" at IQ4_XS.

Can a AMD Radeon RX 7900 XT run DeepSeek R1?

Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.

How much VRAM do I need for 32K context?

The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.

Related guides