dexiio

Best local LLMs for the NVIDIA GeForce RTX 3060 Ti (2026)

The NVIDIA GeForce RTX 3060 Ti has 8 GB of VRAM and 448 GB/s of memory bandwidth. That fits 1 of our 30 tracked models entirely on the GPU at Q4_K_M and 32K context, and 4 more via MoE expert offload. Every figure below is computed from weights + KV cache + overhead, not guessed. Open this GPU in the calculator →

All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.

Fit grid by context length

Model8K16K32K64K128K
Llama 3.1 8BQ4_K_MIQ4_XSIQ4_XSQ8_0Q8_0
Qwen3.5 9BIQ4_XSIQ4_XSIQ4_XSQ8_0Q8_0
Qwen3.5 4BQ8_0Q6_KIQ4_XSQ8_0Q8_0
Gemma 3 4BQ8_0Q8_0Q8_0Q8_0Q5_K_M
Mistral Nemo 12BIQ4_XSIQ4_XSIQ4_XSQ8_0Q8_0
Phi 4 14BQ4_K_MQ4_K_M
Gemma 4 12BIQ4_XSIQ4_XSIQ4_XSIQ4_XSQ8_0
Qwen3 14BIQ4_XSIQ4_XSIQ4_XS
Gemma 3 12BIQ4_XSIQ4_XSIQ4_XSIQ4_XSQ8_0
DeepSeek R1 Distill Qwen 14BQ4_K_MQ4_K_MQ8_0Q8_0Q8_0
GPT OSS 20BQ8_0Q8_0Q8_0Q8_0Q8_0
Rocinante 12BIQ4_XSIQ4_XSIQ4_XSQ8_0Q8_0
Qwen3.5 27BIQ4_XSIQ4_XSQ8_0Q8_0Q5_K_M
Qwen3.5 35B A3BQ8_0Q8_0Q8_0Q8_0Q8_0
Qwen3 Coder 30B A3BQ8_0Q8_0Q8_0Q8_0Q8_0
Gemma 3 27BIQ4_XSIQ4_XSIQ4_XSQ8_0Q8_0
Gemma 4 26B A4BQ8_0Q6_KQ8_0Q8_0Q5_K_M
Mistral Small 3.2 24BIQ4_XSIQ4_XSIQ4_XSQ8_0Q8_0
DeepSeek R1 Distill Qwen 32BQ4_K_MQ4_K_MQ8_0Q8_0Q4_K_M
Cydonia 24BQ4_K_MQ4_K_MQ6_KQ8_0Q8_0
Llama 3.3 70BIQ4_XSQ4_K_MQ4_K_M
Llama 4 ScoutIQ4_XSQ3_K_M
GPT OSS 120BQ8_0Q8_0Q8_0Q8_0
GLM 4.5 AirIQ4_XSIQ4_XS
Qwen3.5 122B A10B
Qwen3.5 397B A17B
DeepSeek R1
Mistral Large 3
Kimi K2.5
Anubis 70BIQ4_XSQ4_K_MQ4_K_M

Fits on GPUExpert offloadPartial offloadCPU only

Top pick per use case

Coding · 32K

Qwen3 Coder 30B A3B Q8_0

Expert offload · ≈ 14 tok/s

Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

Roleplay & writing · 16K

Qwen3.5 35B A3B Q8_0

Expert offload · ≈ 13 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Summarization · 32K

Qwen3.5 35B A3B Q8_0

Expert offload · ≈ 13 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

RAG & documents · 16K

Qwen3.5 35B A3B Q8_0

Expert offload · ≈ 13 tok/s

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Vision / image input · 16K

Gemma 4 26B A4B Q6_K

Expert offload · ≈ 14 tok/s

Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.

Almost fits

These models can't run well on 8 GB at 32K: Llama 3.1 8B, Qwen3.5 9B, Qwen3.5 4B, Mistral Nemo 12B, Phi 4 14B, Gemma 4 12B and 19 more.

What an upgrade unlocks

Stepping up to a Apple M2 Ultra (128GB) (96 GB) unlocks 20 more models on GPU or expert offload at 32K, including Llama 3.1 8B, Qwen3.5 9B, Qwen3.5 4B.

Frequently asked questions

What is the best local LLM for a NVIDIA GeForce RTX 3060 Ti in 2026?

Qwen3 Coder 30B A3B is our top overall pick on the NVIDIA GeForce RTX 3060 Ti: Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

How many local LLMs fit in 8 GB of VRAM?

At Q4_K_M quantization and 32K context, 1 of our 30 tracked models fit entirely in the NVIDIA GeForce RTX 3060 Ti's 8 GB of VRAM, and 4 more MoE models run via expert offload with enough system RAM.

Can a NVIDIA GeForce RTX 3060 Ti run a 70B model like Llama 3.3?

Yes — Llama 3.3 70B runs on the NVIDIA GeForce RTX 3060 Ti as "CPU only" at Q4_K_M, around 1 tokens/sec.

Can a NVIDIA GeForce RTX 3060 Ti run DeepSeek R1?

Not the full 671B model — its Q2_K weights alone exceed 200 GB. The R1-Distill-Qwen 14B/32B models are the practical local alternative on this card.

How much VRAM do I need for 32K context?

The KV cache is separate from the weights and grows linearly with context. For a typical 8-14B dense model at 32K and f16 KV, budget 2-4 GB extra on top of the weights; MLA models like DeepSeek R1 need far less, and quantized KV (q8_0) halves it.

Related guides