dexiio

Best local LLM for rag & documents on a NVIDIA GeForce RTX 4070 Ti SUPER (2026)

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

The verdict

Qwen3.5 35B A3B Q8_0

Expert offload at 16K · RAG & documents score 9/10 · ≈ 13 tok/s · needs 36 GB system RAM

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

llama-server -m Qwen3.5-35B-A3B-Q8_0.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 40

Worthy alternates

Qwen3.5 9B Q8_0

Fits on GPU · ≈ 46 tok/s · RAG & documents 8/10

The new small default. Frontier-distilled, natively multimodal, and embarrassingly good for 6 GB of weights.

Gemma 4 26B A4B Q8_0

Expert offload · ≈ 12 tok/s · RAG & documents 8/10

Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.

Tune this for your exact RAM and settings in the calculator → · All models on the NVIDIA GeForce RTX 4070 Ti SUPER

Frequently asked questions

What is the best local LLM for rag & documents on a NVIDIA GeForce RTX 4070 Ti SUPER?

Qwen3.5 35B A3B at Q8_0 — it scores 9/10 for rag & documents and runs as "Expert offload" at 16K context on the NVIDIA GeForce RTX 4070 Ti SUPER.

How much context do I need for rag & documents?

We recommend 24K tokens for rag & documents (minimum 12K). These picks are computed at 16K.

How fast will it run on a NVIDIA GeForce RTX 4070 Ti SUPER?

Roughly 13 tokens/sec for Qwen3.5 35B A3B — usable for interactive use.

Do I need more than 16 GB of VRAM for rag & documents?

No — the pick above needs 5.5 GB of VRAM plus 36 GB of system RAM at 16K.

What settings should I use?

Start with our command: llama-server -m Qwen3.5-35B-A3B-Q8_0.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 40 — then tune context and KV quant in the fit calculator.