Best local LLMs for the NVIDIA GeForce RTX 4070 (2026)
The NVIDIA GeForce RTX 4070 has 12 GB of VRAM and 504 GB/s of memory bandwidth. That fits 6 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context, and 6 more via MoE expert offload. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →
All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.
Fit grid by context length
| Model | 8K | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|
| Llama 3.1 8B | Q8_0 | Q8_0 | Q5_K_M | IQ4_XS | Q8_0 |
| Qwen3.5 9B | Q8_0 | Q6_K | Q4_K_M | IQ4_XS | Q8_0 |
| Qwen3.5 4B | Q8_0 | Q8_0 | Q8_0 | IQ4_XS | Q8_0 |
| Gemma 3 4B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Mistral Nemo 12B | Q5_K_M | Q4_K_M | IQ4_XS | Q8_0 | Q8_0 |
| Phi 4 14B | Q4_K_M | Q4_K_M | — | — | — |
| Gemma 4 12B | Q6_K | Q5_K_M | Q4_K_M | IQ4_XS | Q8_0 |
| Qwen3 14B | Q4_K_M | IQ4_XS | IQ4_XS | — | — |
| Gemma 3 12B | Q6_K | Q5_K_M | Q4_K_M | IQ4_XS | Q8_0 |
| DeepSeek R1 Distill Qwen 14B | Q4_K_M | Q4_K_M | Q4_K_M | Q8_0 | Q8_0 |
| GPT OSS 20B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Rocinante 12B | Q5_K_M | Q4_K_M | IQ4_XS | Q8_0 | Q8_0 |
| Qwen3.5 27B | IQ4_XS | IQ4_XS | IQ4_XS | Q8_0 | Q5_K_M |
| Qwen3.5 35B A3B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Qwen3 Coder 30B A3B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q8_0 |
| Gemma 3 27B | IQ4_XS | IQ4_XS | IQ4_XS | IQ4_XS | Q8_0 |
| Gemma 4 26B A4B | Q8_0 | Q8_0 | Q6_K | Q8_0 | Q5_K_M |
| Mistral Small 3.2 24B | IQ4_XS | IQ4_XS | IQ4_XS | Q8_0 | Q8_0 |
| DeepSeek R1 Distill Qwen 32B | Q4_K_M | Q4_K_M | Q4_K_M | Q8_0 | Q4_K_M |
| Cydonia 24B | Q4_K_M | Q4_K_M | Q4_K_M | Q8_0 | Q8_0 |
| Llama 3.3 70B | IQ4_XS | IQ4_XS | Q4_K_M | — | — |
| Llama 4 Scout | Q4_K_M | Q4_K_M | Q4_K_M | — | — |
| GPT OSS 120B | Q8_0 | Q8_0 | Q8_0 | Q8_0 | — |
| GLM 4.5 Air | IQ4_XS | IQ4_XS | Q3_K_M | — | — |
| Qwen3.5 122B A10B | — | — | — | — | — |
| Qwen3.5 397B A17B | — | — | — | — | — |
| DeepSeek R1 | — | — | — | — | — |
| Mistral Large 3 | — | — | — | — | — |
| Kimi K2.5 | — | — | — | — | — |
| Anubis 70B | IQ4_XS | IQ4_XS | Q4_K_M | — | — |
Fits on GPUExpert offloadPartial offloadCPU only
Top pick per use case
Coding · 32K
Qwen3 Coder 30B A3B Q8_0
Expert offload · ≈ 14 tok/s
Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.
Roleplay & writing · 16K
Rocinante 12B Q4_K_M
Fits on GPU · ≈ 44 tok/s
The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.
Summarization · 32K
Qwen3.5 35B A3B Q8_0
Expert offload · ≈ 13 tok/s
The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.
RAG & documents · 16K
Qwen3.5 35B A3B Q8_0
Expert offload · ≈ 13 tok/s
The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.
Vision / image input · 16K
Gemma 4 26B A4B Q8_0
Expert offload · ≈ 12 tok/s
Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.
Almost fits
These models can't run well on 12 GB at 32K: Mistral Nemo 12B, Phi 4 14B, Qwen3 14B, DeepSeek R1 Distill Qwen 14B, Rocinante 12B, Qwen3.5 27B and 12 more.
What an upgrade unlocks
Stepping up to a Apple M2 Ultra (128GB) (96 GB) unlocks 13 more models on GPU or expert offload at 32K, including Mistral Nemo 12B, Qwen3 14B, DeepSeek R1 Distill Qwen 14B.
Frequently asked questions
What is the best local LLM for a NVIDIA GeForce RTX 4070 in 2026?
Qwen3 Coder 30B A3B is our top overall pick on the NVIDIA GeForce RTX 4070: Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.
How many local LLMs fit in 12 GB of VRAM?
At Q4_K_M quantization and 32K context, 6 of our 30 tracked models fit entirely in the NVIDIA GeForce RTX 4070's 12 GB of VRAM, and 6 more MoE models run via expert offload with enough system RAM.
Can a NVIDIA GeForce RTX 4070 run a 70B model like Llama 3.3?
Yes — Llama 3.3 70B runs on the NVIDIA GeForce RTX 4070 as "CPU only" at Q4_K_M, around 1 tokens/sec.
Can a NVIDIA GeForce RTX 4070 run DeepSeek R1?
Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.
How much VRAM do I need for 32K context?
The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.