Q4_K_M vs Q8_0 for Local LLMs: Which Quantization Level Actually Wins
Updated June 23, 2026
Running large language models locally means fitting billions of parameters into whatever VRAM you actually have. Quantization is the lever that makes that possible: you trade precision for memory savings. The two levels developers reach for most often are Q4_K_M (4-bit, k-quant medium) and Q8_0 (8-bit, round-to-nearest). One cuts your VRAM bill roughly in half compared to the other. The question is whether the quality loss matters for your workload.
This comparison covers what each format does, how much memory it costs, where each one breaks down, and which you should default to in 2026.
| Feature | Q4_K_M | Q8_0 |
|---|---|---|
| Bits per weight | ~4.8 effective | 8 |
| VRAM for Llama 3.3 8B | ~5 GB | ~9 GB |
| VRAM for Qwen 3 32B | ~20 GB | ~36 GB |
| Quality loss vs FP16 | Measurable on benchmarks, rarely noticed in chat | Negligible |
| Tokens/sec (same GPU) | Faster (smaller KV cache) | Slower |
| Best for | Consumer GPUs, 8-24 GB VRAM | 48+ GB VRAM or quality-critical tasks |
| File format | GGUF | GGUF |
What quantization actually does to your weights
Full-precision (FP16) models store each parameter as a 16-bit float. A 7B-parameter model at FP16 needs roughly 14 GB of VRAM just for weights, before you account for the KV cache and runtime overhead. Most consumer GPUs top out at 8, 12, 16, or 24 GB. Without quantization, you are locked out of anything above about 7B parameters on a typical gaming card.
Quantization maps those 16-bit floats down to lower bit-widths. The GGUF format (used by llama.cpp and every tool built on it) packages these quantized weights into a single file you can run on CPU, GPU, or a mix of both.
Q4_K_M and Q8_0 are not the only levels available. You will also see Q4_0, Q4_K_S, Q5_K_M, Q6_K, and others. But Q4_K_M and Q8_0 sit at the two practical extremes most developers actually choose between: the smallest level that still produces good output, and the largest level before you might as well run FP16.
How Q4_K_M works (and why the "K_M" matters)
Q4_0 was the original 4-bit quantization in llama.cpp: every weight gets mapped to one of 16 values using a single scale factor per block. It works, but it throws away information unevenly. Some layers in a transformer are more sensitive to precision loss than others.
Q4_K_M ("k-quant medium") improves on this by using different quantization parameters for different tensor types. Attention layers and the output head, which matter more for coherence, get slightly higher precision. Less sensitive feed-forward layers get more aggressive compression. The result: Q4_K_M uses roughly the same memory as Q4_0 (about 4.8 bits per weight on average) but scores noticeably better on perplexity benchmarks.
If you are running Ollama, it defaults to Q4_K_M when you ollama pull a model. That is not an accident. It is the sweet spot for the 8-24 GB VRAM range where most developers land.
How Q8_0 works
Q8_0 is straightforward round-to-nearest 8-bit quantization. Each weight gets mapped to one of 256 values. The quality loss compared to FP16 is small enough that you need benchmarks to detect it; in blind chat comparisons, most people cannot tell. The cost is that your model file is roughly twice the size of Q4_K_M for the same parameter count.
For a practical reference, Big Data Boutique's local LLM hardware guide lists a Qwen 3 32B model at Q4 needing about 20 GB of VRAM. The same model at Q8 would need around 36 GB, pushing it off a single 24 GB card and into multi-GPU or CPU offloading territory.
Quality: where the gap shows up
The honest answer: for most conversational and coding tasks at 7B-14B scale, Q4_K_M output is good enough that you will not notice the difference. Llama 3.3 8B at Q4_K_M scores approximately 73.0 on MMLU, which sits in a range that required cloud API calls just two years ago.
Where Q4_K_M starts to hurt:
- Long structured output. JSON generation, code with deep nesting, or tasks requiring precise numerical reasoning accumulate small errors across many tokens. Q8_0 drifts less.
- Multilingual and low-resource languages. Quantization disproportionately affects tokens the model saw less during training. If you are working in a language other than English or Chinese, Q8_0 preserves more of the model's weaker capabilities.
- Very large models at the edge of your VRAM. A 70B model at Q4_K_M with aggressive GPU offloading can work, but you are already stacking compromises. If the model barely fits at Q8_0, the quality gain per token may justify the slower inference from partial CPU offload.
For general chat, summarization, and code completion at 7B-14B, Q4_K_M is the practical default.
Speed: smaller is faster
Inference speed in llama.cpp (and tools built on it) is largely memory-bandwidth bound. Smaller quantizations move fewer bytes per token, so they run faster on the same hardware. Q4_K_M will typically generate 30-50% more tokens per second than Q8_0 on the same GPU, assuming the model fits in VRAM in both cases.
One user on r/LocalLLaMA reported running Qwen 3.5-122B at Q4 fully in GPU RAM at around 50 tokens per second, a model that simply would not fit at Q8 without extreme multi-GPU setups.
This matters for agentic workflows and multi-turn conversations where latency compounds. If your pipeline calls the model dozens of times per task, the speed gap between Q4_K_M and Q8_0 adds up fast.
When to use each one
Default to Q4_K_M when:
- Your GPU has 8-24 GB of VRAM
- You want to run the largest model that fits (e.g., 14B-32B range on 16-24 GB)
- Speed matters more than marginal quality gains
- You are doing chat, summarization, or standard code completion
Pick Q8_0 when:
- You have 48+ GB of VRAM (A6000, dual consumer cards, or an M-series Mac with unified memory)
- Your task involves precise structured output (JSON schemas, data extraction)
- You are evaluating model quality and need a near-FP16 baseline
- The model is small enough that Q8_0 still fits comfortably (e.g., a 7B model on a 16 GB card)
There is a middle ground worth mentioning: Q5_K_M and Q6_K sit between the two and can be a good compromise if you have a few extra gigabytes of VRAM but not enough for Q8_0. Tools like LM Studio let you pick the quantization level at download time, making it easy to experiment.
Q4_K_M
Pros
- Fits large models on consumer GPUs (8-24 GB VRAM)
- 30-50% faster inference than Q8_0 on the same hardware
- K-quant preserves quality in sensitive layers
- Default in Ollama for good reason
Cons
- Measurable quality loss on structured output and multilingual tasks
- Not ideal for benchmark evaluation or quality-sensitive pipelines
- Accumulated drift in very long outputs
Q8_0
Pros
- Near-FP16 quality, negligible perplexity loss
- Better for structured generation and precision tasks
- Good baseline for model evaluation
Cons
- Roughly 2x the VRAM of Q4_K_M
- Slower inference (more bytes per token)
- Pushes many models off single consumer GPUs
The practical decision tree
Ask yourself one question: does the model I want fit at Q8_0 in my VRAM, with room left for a reasonable context window? If yes, use Q8_0. If no, use Q4_K_M. That is the entire decision for most developers.
The only exception is speed-sensitive pipelines (agents, batch processing, tool-calling loops) where Q4_K_M's throughput advantage matters even if Q8_0 technically fits. In that case, benchmark both and check whether the quality difference affects your downstream task. Often it does not.
Related comparisons
Self-Hosting vs API: How Much Does Running an LLM Actually Cost in 2026?
LLM costs range from free (local open-weight models) to $100M+ (frontier training). We break down self-hosting vs API pricing so you can pick the cheaper path for your workload.
Read comparison →Local LLMsLLM vs Foundation Model: What Developers Actually Need to Know
Every LLM is a foundation model, but not every foundation model is an LLM. Here is what that hierarchy means for your architecture decisions, model selection, and deployment.
Read comparison →Local LLMsGenerative AI vs LLMs: What Developers Actually Need to Know
LLMs are a subset of generative AI, not a synonym. Here is what each term actually covers, where they overlap, and why the distinction matters when you are picking tools.
Read comparison →Local LLMsOllama vs LM Studio API: Which Local LLM Server Fits Your Stack in 2026
Both Ollama and LM Studio expose OpenAI-compatible local LLM APIs, but they target different workflows. We compare server setup, endpoint coverage, and integration tradeoffs so you can pick the right one.
Read comparison →