Best local LLM for roleplay & writing on a NVIDIA GeForce RTX 3080 10GB (2026)
All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.
The verdict
GLM 4.5 Air IQ4_XS
Expert offload at 16K · Roleplay & writing score 7/10 · ≈ 6 tok/s · needs 56 GB system RAM
The community's favorite big MoE: agentic coding chops and genuinely good prose in one 12B-active package.
llama-server -m GLM-4.5-Air-IQ4_XS.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 46
Worthy alternates
Qwen3.5 35B A3B Q8_0
Expert offload · ≈ 13 tok/s · Roleplay & writing 6/10
The meta pick, full stop. Near-dense-30B quality at 3B-active speed, and expert offload puts it on 8 GB cards.
Gemma 4 26B A4B Q8_0
Expert offload · ≈ 12 tok/s · Roleplay & writing 6/10
Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.
Tune this for your exact RAM and settings in the calculator → · All models on the NVIDIA GeForce RTX 3080 10GB
Frequently asked questions
What is the best local LLM for roleplay & writing on a NVIDIA GeForce RTX 3080 10GB?
GLM 4.5 Air at IQ4_XS — it scores 7/10 for roleplay & writing and runs as "Expert offload" at 16K context on the NVIDIA GeForce RTX 3080 10GB.
How much context do I need for roleplay & writing?
We recommend 16K tokens for roleplay & writing (minimum 8K). These picks are computed at 16K.
How fast will it run on a NVIDIA GeForce RTX 3080 10GB?
Roughly 6 tokens/sec for GLM 4.5 Air — usable for interactive use.
Do I need more than 10 GB of VRAM for roleplay & writing?
No — the pick above needs 9.1 GB of VRAM plus 56 GB of system RAM at 16K.
What settings should I use?
Start with our command: llama-server -m GLM-4.5-Air-IQ4_XS.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 46 — then tune context and KV quant in the fit calculator.