Local LLM hardware guides by model

Every model we track, with exact GGUF sizes and the smallest GPU that runs it well at 32K context. Each guide has per-quant VRAM tables and a full GPU compatibility grid.

Llama 3.1 8B

8.03B · Q4_K_M 4.6 GB

Aging, but the most finetuned model in history. The safe default when tooling compatibility matters more than raw smarts.

Runs well on a NVIDIA GeForce RTX 2080 Ti →

Qwen3.5 9B

8.95B · Q4_K_M 5.3 GB

The new small default. Frontier-distilled, natively multimodal, and embarrassingly good for 6 GB of weights.

Runs well on a NVIDIA GeForce RTX 2080 Ti →

Qwen3.5 4B

4.21B · Q4_K_M 2.6 GB

Laptop-class. Strong retrieval and extraction for its size; don't ask it to architect your codebase.

Runs well on a NVIDIA GeForce RTX 3080 10GB →

Gemma 3 4B

3.88B · Q4_K_M 2.3 GB

Edge Gemma with real vision. Fine as a first model on integrated graphics; outclassed by Qwen at the same size.

Runs well on a NVIDIA GeForce RTX 5060 →

Mistral Nemo 12B

12.25B · Q4_K_M 7.0 GB

The base under half the roleplay finetunes ever made. Natural prose, light guardrails, happiest at 16K.

Runs well on a NVIDIA GeForce RTX 5080 →

Phi 4 14B

14.66B · Q4_K_M 8.3 GB

Dense reasoning-per-gigabyte champion, dry as toast. The 16K context ceiling is the real constraint.

Cloud-rental class →

Gemma 4 12B

11.91B · Q4_K_M 6.6 GB

June 2026's 16GB-class headline: dense 12B with native vision and audio in one backbone. The new laptop ceiling.

Runs well on a NVIDIA GeForce RTX 2080 Ti →

Qwen3 14B

14.77B · Q4_K_M 8.4 GB

Held over from the 3.0 line because nothing in 3.5 lands between 9B and 27B. Still the solid dense mid-card pick.

Runs well on a NVIDIA GeForce RTX 5080 →

Gemma 3 12B

11.77B · Q4_K_M 6.8 GB

Balanced 12GB-card resident with dependable vision. Grounded, unpreachy prose is its quiet strength.

Runs well on a NVIDIA GeForce RTX 2080 Ti →

DeepSeek R1 Distill Qwen 14B

14.77B · Q4_K_M 8.4 GB

R1's reasoning habits in a 14B body. Thinking tokens buy real math and logic gains — and tax every casual reply.

Runs well on a AMD Radeon RX 7900 XT →

GPT OSS 20B

20.91B (4.19B active) · Q4_K_M 10.8 GB

OpenAI's small MoE: 4B active means it flies on modest cards. Heavily filtered — keep it on work tasks.

Runs well on a NVIDIA GeForce RTX 5060 →

Rocinante 12B

12.25B · Q4_K_M 7.0 GB

The budget roleplay king. Lowest slop-per-token of anything under 24 GB; the community keeps it alive for a reason.

Runs well on a NVIDIA GeForce RTX 5080 →

Qwen3.5 27B

26.9B · Q4_K_M 15.6 GB

The dense 24GB workhorse. If you want one model on a 3090 and no surprises, it's this.

Runs well on a NVIDIA GeForce RTX 4090 →

Qwen3.5 35B A3B

34.66B (3.45B active) · Q4_K_M 20.5 GB

The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.

Runs well on a NVIDIA GeForce RTX 5060 →

Qwen3 Coder 30B A3B

30.53B (3.35B active) · Q4_K_M 17.3 GB

Purpose-built agentic coder. Best local fill-in-the-middle and tool-calling under 70B; useless at small talk.

Runs well on a NVIDIA GeForce RTX 5060 →

Gemma 3 27B

27.01B · Q4_K_M 15.4 GB

Sliding-window attention keeps long context cheap, and the vision stack still beats most of 2026's newcomers.

Runs well on a AMD Radeon RX 7900 XT →

Gemma 4 26B A4B

25.23B (3.8B active) · Q4_K_M 15.9 GB

Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.

Runs well on a NVIDIA GeForce RTX 5070 →

Mistral Small 3.2 24B

23.57B · Q4_K_M 13.3 GB

No weaknesses, no crown. Vision, solid code, natural prose — the all-rounder when you refuse to pick a lane.

Runs well on a AMD Radeon RX 7900 XT →

DeepSeek R1 Distill Qwen 32B

32.76B · Q4_K_M 18.5 GB

The largest R1 distill that fits 24 GB. Strongest local math-and-logic per dollar of hardware.

Runs well on a NVIDIA GeForce RTX 5090 →

Cydonia 24B

23.57B · Q4_K_M 13.3 GB

Rocinante's bigger sibling on a Mistral Small base. The default serious-RP pick for 24 GB cards.

Runs well on a NVIDIA GeForce RTX 4090 →

Llama 3.3 70B

70.55B · Q4_K_M 39.6 GB

Still the dense 70B reference. Prose depth the MoEs haven't matched; the hardware bill is the price of admission.

Runs well on a Apple M2 Max (64GB) →

Llama 4 Scout

107.77B (17.17B active) · Q4_K_M 60.9 GB

17B active from 109B total, huge context on paper. Divisive reception; the long-document niche is where it earns keep.

Runs well on a NVIDIA GeForce RTX 5080 →

GPT OSS 120B

116.83B (5.71B active) · Q4_K_M 58.5 GB

The big filtered brain. Expert offload on a 24 GB card plus 64 GB RAM makes it a realistic home deployment.

Runs well on a NVIDIA GeForce RTX 5060 →

GLM 4.5 Air

110.47B (14.96B active) · Q4_K_M 68.0 GB

The community's favorite big MoE: agentic coding chops and genuinely good prose in one 12B-active package.

Runs well on a NVIDIA GeForce RTX 5070 →

Qwen3.5 122B A10B

122.11B (9.77B active) · Q4_K_M 71.3 GB

Arguably the best model a 128 GB Mac can run today. The unified-memory sweet spot the 397B can't reach.

Runs well on a Apple M2 Ultra (128GB) →

Qwen3.5 397B A17B

396.35B (17.35B active) · Q4_K_M 227.3 GB

The 3.5 flagship. Frontier-adjacent everything; for nearly everyone, this is a rental, not a purchase.

Cloud-rental class →

DeepSeek R1

671.03B (37.55B active) · Q4_K_M 376.7 GB

The open reasoning heavyweight, with famously vivid prose as a side effect. MLA keeps its KV cache absurdly small.

Cloud-rental class →

Mistral Large 3

673.42B (39.95B active) · Q4_K_M 379.0 GB

Mistral's 675B MLA+MoE flagship. European frontier weight class — and a cloud referral for all but server rigs.

Cloud-rental class →

Kimi K2.5

1026.41B (32.86B active) · Q4_K_M 578.6 GB

A trillion parameters, 33B active. The strongest open agentic coder going — and the definition of rent-don't-buy.

Cloud-rental class →

Anubis 70B

70.55B · Q4_K_M 39.6 GB

The 70B-class roleplay tune with current GGUFs. Llama 3.3 prose depth, none of the corporate manners.

Runs well on a Apple M2 Max (64GB) →