Best local LLM for vision / image input on a NVIDIA GeForce RTX 4080 SUPER (2026)
All figures assume an f16 KV cache, a 0.6 GB display reserve on the GPU, and 64 GB of DDR5 system RAM for the offload tiers. Tune these in the calculator.
The verdict
Gemma 4 26B A4B Q8_0
Expert offload at 16K · Vision / image input score 8/10 · ≈ 12 tok/s · needs 27 GB system RAM
Google's fast MoE with native audio in. Nearly all of its weight sits in routed experts, so expert offload runs it comfortably on 12 GB cards.
llama-server -m google_gemma-4-26B-A4B-it-Q8_0.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 30
Worthy alternates
Gemma 4 12B Q8_0
Fits on GPU · ≈ 38 tok/s · Vision / image input 7/10
June 2026's 16GB-class headline: dense 12B with native vision and audio in one backbone. The new laptop ceiling.
Qwen3.5 9B Q8_0
Fits on GPU · ≈ 50 tok/s · Vision / image input 6/10
The new small default. Frontier-distilled, natively multimodal, and embarrassingly good for 6 GB of weights.
Tune this for your exact RAM and settings in the calculator → · All models on the NVIDIA GeForce RTX 4080 SUPER
Frequently asked questions
What is the best local LLM for vision / image input on a NVIDIA GeForce RTX 4080 SUPER?
Gemma 4 26B A4B at Q8_0 — it scores 8/10 for vision / image input and runs as "Expert offload" at 16K context on the NVIDIA GeForce RTX 4080 SUPER.
How much context do I need for vision / image input?
We recommend 16K tokens for vision / image input (minimum 8K). These picks are computed at 16K.
How fast will it run on a NVIDIA GeForce RTX 4080 SUPER?
Roughly 12 tokens/sec for Gemma 4 26B A4B — usable for interactive use.
Do I need more than 16 GB of VRAM for vision / image input?
No — the pick above needs 8.1 GB of VRAM plus 27 GB of system RAM at 16K.
What settings should I use?
Start with our command: llama-server -m google_gemma-4-26B-A4B-it-Q8_0.gguf -c 16384 --flash-attn -ngl 99 --n-cpu-moe 30 — then tune context and KV quant in the fit calculator.