Local LLM hardware guides by model
Every model we track, with exact GGUF sizes and the smallest GPU that runs it well at 32K context. Each guide has per-quant VRAM tables and a full GPU compatibility grid.
Llama 3.1 8B
8.03B · Q4_K_M 4.6 GB
Aging, but the most finetuned model in history. The safe default when tooling compatibility matters more than raw smarts.
Runs well on a NVIDIA GeForce RTX 2080 Ti →
Qwen3.5 9B
8.95B · Q4_K_M 5.3 GB
The new small default. Frontier-distilled, natively multimodal, and embarrassingly good for 6 GB of weights.
Runs well on a NVIDIA GeForce RTX 2080 Ti →
Qwen3.5 4B
4.21B · Q4_K_M 2.6 GB
Laptop-class. Strong retrieval and extraction for its size; don't ask it to architect your codebase.
Runs well on a NVIDIA GeForce RTX 3080 10GB →
Gemma 3 4B
3.88B · Q4_K_M 2.3 GB
Edge Gemma with real vision. Fine as a first model on integrated graphics; outclassed by Qwen at the same size.
Runs well on a NVIDIA GeForce RTX 5060 →
Mistral Nemo 12B
12.25B · Q4_K_M 7.0 GB
The base under half the roleplay finetunes ever made. Natural prose, light guardrails, happiest at 16K.
Runs well on a NVIDIA GeForce RTX 5080 →
Phi 4 14B
14.66B · Q4_K_M 8.3 GB
Dense reasoning-per-gigabyte champion, dry as toast. The 16K context ceiling is the real constraint.
Cloud-rental class →
Gemma 4 12B
11.91B · Q4_K_M 6.6 GB
June 2026's 16GB-class headline: dense 12B with native vision and audio in one backbone. The new laptop ceiling.
Runs well on a NVIDIA GeForce RTX 2080 Ti →
Qwen3 14B
14.77B · Q4_K_M 8.4 GB
Held over from the 3.0 line because nothing in 3.5 lands between 9B and 27B. Still the solid dense mid-card pick.
Runs well on a NVIDIA GeForce RTX 5080 →
Gemma 3 12B
11.77B · Q4_K_M 6.8 GB
Balanced 12GB-card resident with dependable vision. Grounded, unpreachy prose is its quiet strength.
Runs well on a NVIDIA GeForce RTX 2080 Ti →
DeepSeek R1 Distill Qwen 14B
14.77B · Q4_K_M 8.4 GB
R1's reasoning habits in a 14B body. Thinking tokens buy real math and logic gains — and tax every casual reply.
Runs well on a AMD Radeon RX 7900 XT →
GPT OSS 20B
20.91B (4.19B active) · Q4_K_M 10.8 GB
OpenAI's small MoE: 4B active means it flies on modest cards. Heavily filtered — keep it on work tasks.
Runs well on a NVIDIA GeForce RTX 5060 →
Rocinante 12B
12.25B · Q4_K_M 7.0 GB
The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.
Runs well on a NVIDIA GeForce RTX 5080 →
Qwen3.5 27B
26.9B · Q4_K_M 15.6 GB
The dense 24GB workhorse. If you want one model on a 3090 and no surprises, it's this.
Runs well on a NVIDIA GeForce RTX 4090 →
Qwen3.5 35B A3B
34.66B (3.45B active) · Q4_K_M 20.5 GB
The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.
Runs well on a NVIDIA GeForce RTX 5060 →
Qwen3 Coder 30B A3B
30.53B (3.35B active) · Q4_K_M 17.3 GB
Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.
Runs well on a NVIDIA GeForce RTX 5060 →
Gemma 3 27B
27.01B · Q4_K_M 15.4 GB
Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.
Runs well on a AMD Radeon RX 7900 XT →
Gemma 4 26B A4B
25.23B (3.8B active) · Q4_K_M 15.9 GB
Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.
Runs well on a NVIDIA GeForce RTX 5070 →
Mistral Small 3.2 24B
23.57B · Q4_K_M 13.3 GB
No weaknesses, no crown. Vision, solid code, natural prose — the all-rounder when you refuse to pick a lane.
Runs well on a AMD Radeon RX 7900 XT →
DeepSeek R1 Distill Qwen 32B
32.76B · Q4_K_M 18.5 GB
The largest R1 distill that fits 24 GB. Strongest local math-and-logic per dollar of hardware.
Runs well on a NVIDIA GeForce RTX 5090 →
Cydonia 24B
23.57B · Q4_K_M 13.3 GB
Rocinante's bigger sibling on a Mistral Small base. The default serious-RP pick for 24 GB cards.
Runs well on a NVIDIA GeForce RTX 4090 →
Llama 3.3 70B
70.55B · Q4_K_M 39.6 GB
Still the dense 70B reference. Prose depth the MoEs haven't matched; the hardware bill is the price of admission.
Runs well on a Apple M2 Max (64GB) →
Llama 4 Scout
107.77B (17.17B active) · Q4_K_M 60.9 GB
17B active from 109B total, huge context on paper. Divisive reception; the long-document niche is where it earns keep.
Runs well on a NVIDIA GeForce RTX 5080 →
GPT OSS 120B
116.83B (5.71B active) · Q4_K_M 58.5 GB
The big filtered brain. Expert offload on a 24 GB card plus 64 GB RAM makes it a realistic home deployment.
Runs well on a NVIDIA GeForce RTX 5060 →
GLM 4.5 Air
110.47B (14.96B active) · Q4_K_M 68.0 GB
The community's favorite big MoE: agentic coding chops and genuinely good prose in one 12B-active package.
Runs well on a NVIDIA GeForce RTX 5070 →
Qwen3.5 122B A10B
122.11B (9.77B active) · Q4_K_M 71.3 GB
Arguably the best model a 128 GB Mac can run today. The unified-memory sweet spot the 397B can't reach.
Runs well on a Apple M2 Ultra (128GB) →
Qwen3.5 397B A17B
396.35B (17.35B active) · Q4_K_M 227.3 GB
The 3.5 flagship. Frontier-adjacent everything; for nearly everyone, this is a rental, not a purchase.
Cloud-rental class →
DeepSeek R1
671.03B (37.55B active) · Q4_K_M 376.7 GB
The open reasoning heavyweight, with famously vivid prose as a side effect. MLA keeps its KV cache absurdly small.
Cloud-rental class →
Mistral Large 3
673.42B (39.95B active) · Q4_K_M 379.0 GB
Mistral's 675B MLA+MoE flagship. European frontier weight class — and a cloud referral for all but server rigs.
Cloud-rental class →
Kimi K2.5
1026.41B (32.86B active) · Q4_K_M 578.6 GB
A trillion parameters, 33B active. The strongest open agentic coder going — and the definition of rent-don't-buy.
Cloud-rental class →
Anubis 70B
70.55B · Q4_K_M 39.6 GB
The 70B-class roleplay tune with current GGUFs. Llama 3.3 prose depth, none of the corporate manners.
Runs well on a Apple M2 Max (64GB) →