Best local LLMs for the AMD Radeon RX 7900 XTX (2026)

The AMD Radeon RX 7900 XTX has 24 GB of VRAM and 960 GB/s of memory bandwidth. That fits 15 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context, and 4 more via MoE expert offload. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

Fit grid by context length

Model	8K	16K	32K	64K	128K
Llama 3.1 8B	Q8_0	Q8_0	Q8_0	Q8_0	Q4_K_M
Qwen3.5 9B	Q8_0	Q8_0	Q8_0	Q8_0	IQ4_XS
Qwen3.5 4B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
Gemma 3 4B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
Mistral Nemo 12B	Q8_0	Q8_0	Q8_0	Q6_K	Q4_K_M
Phi 4 14B	Q8_0	Q8_0	—	—	—
Gemma 4 12B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
Qwen3 14B	Q8_0	Q8_0	Q8_0	—	—
Gemma 3 12B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
DeepSeek R1 Distill Qwen 14B	Q8_0	Q8_0	Q8_0	Q4_K_M	Q8_0
GPT OSS 20B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
Rocinante 12B	Q8_0	Q8_0	Q8_0	Q6_K	Q4_K_M
Qwen3.5 27B	Q5_K_M	Q4_K_M	IQ4_XS	IQ4_XS	Q5_K_M
Qwen3.5 35B A3B	Q4_K_M	Q4_K_M	Q8_0	Q8_0	Q8_0
Qwen3 Coder 30B A3B	Q5_K_M	Q5_K_M	Q4_K_M	IQ4_XS	Q8_0
Gemma 3 27B	Q6_K	Q5_K_M	Q5_K_M	Q4_K_M	IQ4_XS
Gemma 4 26B A4B	Q5_K_M	Q5_K_M	IQ4_XS	Q8_0	Q5_K_M
Mistral Small 3.2 24B	Q6_K	Q6_K	Q5_K_M	IQ4_XS	IQ4_XS
DeepSeek R1 Distill Qwen 32B	Q4_K_M	Q4_K_M	Q4_K_M	Q4_K_M	Q4_K_M
Cydonia 24B	Q6_K	Q6_K	Q5_K_M	Q4_K_M	Q5_K_M
Llama 3.3 70B	IQ4_XS	IQ4_XS	IQ4_XS	Q3_K_M	—
Llama 4 Scout	Q4_K_M	Q4_K_M	Q4_K_M	Q4_K_M	—
GPT OSS 120B	Q8_0	Q8_0	Q8_0	Q8_0	Q8_0
GLM 4.5 Air	IQ4_XS	IQ4_XS	IQ4_XS	IQ4_XS	—
Qwen3.5 122B A10B	Q4_K_M	Q4_K_M	Q4_K_M	Q4_K_M	—
Qwen3.5 397B A17B	—	—	—	—	—
DeepSeek R1	—	—	—	—	—
Mistral Large 3	—	—	—	—	—
Kimi K2.5	—	—	—	—	—
Anubis 70B	IQ4_XS	IQ4_XS	IQ4_XS	Q3_K_M	—

Fits on GPUExpert offloadPartial offloadCPU only

Top pick per use case

Coding · 32K

Qwen3 Coder 30B A3B Q4_K_M

Fits on GPU · ≈ 189 tok/s

Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

Roleplay & writing · 16K

Rocinante 12B Q8_0

Fits on GPU · ≈ 48 tok/s

The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.

Summarization · 32K

Gemma 3 27B Q5_K_M

Fits on GPU · ≈ 32 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

RAG & documents · 16K

Qwen3.5 35B A3B Q4_K_M

Fits on GPU · ≈ 175 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Vision / image input · 16K

Gemma 3 27B Q5_K_M

Fits on GPU · ≈ 32 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

Almost fits

These models can't run well on 24 GB at 32K: Phi 4 14B, DeepSeek R1 Distill Qwen 32B, Llama 3.3 70B, Qwen3.5 122B A10B, Qwen3.5 397B A17B, DeepSeek R1 and 3 more.

What an upgrade unlocks

Stepping up to a Apple M2 Ultra (128GB) (96 GB) unlocks 4 more models on GPU or expert offload at 32K, including DeepSeek R1 Distill Qwen 32B, Llama 3.3 70B, Qwen3.5 122B A10B.

Frequently asked questions

What is the best local LLM for a AMD Radeon RX 7900 XTX in 2026?

Qwen3 Coder 30B A3B is our top overall pick on the AMD Radeon RX 7900 XTX: Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

How many local LLMs fit in 24 GB of VRAM?

At Q4_K_M quantization and 32K context, 15 of our 30 tracked models fit entirely in the AMD Radeon RX 7900 XTX's 24 GB of VRAM, and 4 more MoE models run via expert offload with enough system RAM.

Can a AMD Radeon RX 7900 XTX run a 70B model like Llama 3.3?

Yes — Llama 3.3 70B runs on the AMD Radeon RX 7900 XTX as "Partial offload" at IQ4_XS.

Can a AMD Radeon RX 7900 XTX run DeepSeek R1?

Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.

How much VRAM do I need for 32K context?

The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.

Best local LLMs for the AMD Radeon RX 7900 XTX (2026)

Fit grid by context length

Top pick per use case

Qwen3 Coder 30B A3B Q4_K_M

Rocinante 12B Q8_0

Gemma 3 27B Q5_K_M

Qwen3.5 35B A3B Q4_K_M

Gemma 3 27B Q5_K_M

Almost fits

What an upgrade unlocks

Frequently asked questions

What is the best local LLM for a AMD Radeon RX 7900 XTX in 2026?

How many local LLMs fit in 24 GB of VRAM?

Can a AMD Radeon RX 7900 XTX run a 70B model like Llama 3.3?

Can a AMD Radeon RX 7900 XTX run DeepSeek R1?

How much VRAM do I need for 32K context?

Related guides