dexiio

Best local LLM for summarization on a AMD Radeon RX 7900 XT (2026)

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

The verdict

Gemma 3 27B IQ4_XS

Fits on GPU at 32K · Summarization score 8/10 · ≈ 35 tok/s

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

llama-server -m gemma-3-27b-it-IQ4_XS.gguf -c 32768 --flash-attn -ngl 99

Worthy alternates

Qwen3.5 35B A3B Q8_0

Expert offload · ≈ 13 tok/s · Summarization 8/10

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Gemma 4 26B A4B Q8_0

Expert offload · ≈ 12 tok/s · Summarization 8/10

Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.

Tune this for your exact RAM and settings in the calculator → · All models on the AMD Radeon RX 7900 XT

Frequently asked questions

What is the best local LLM for summarization on a AMD Radeon RX 7900 XT?

Gemma 3 27B at IQ4_XS — it scores 8/10 for summarization and runs as "Fits on GPU" at 32K context on the AMD Radeon RX 7900 XT.

How much context do I need for summarization?

We recommend 48K tokens for summarization (minimum 24K). These picks are computed at 32K.

How fast will it run on a AMD Radeon RX 7900 XT?

Roughly 35 tokens/sec for Gemma 3 27B — comfortable for interactive use.

Do I need more than 20 GB of VRAM for summarization?

No — the pick above needs 18.9 GB of VRAM at 32K.

What settings should I use?

Start with our command: llama-server -m gemma-3-27b-it-IQ4_XS.gguf -c 32768 --flash-attn -ngl 99 — then tune context and KV quant in the fit calculator.