Self-Hosting vs API: How Much Does Running an LLM Actually Cost in 2026?
Updated June 23, 2026
The question "how much is an LLM" has at least three honest answers depending on what you mean: training one from scratch, renting one through an API, or running an open-weight model on your own hardware. The spread is enormous. Training GPT-4 cost north of $78 million in compute alone, according to the Stanford AI Index Report 2025. Calling Claude 4 Sonnet through Anthropic's API costs a fraction of a cent per request. Running Llama 3 on a used workstation costs the price of electricity.
This post compares the two paths most developers actually choose: self-hosted open-weight models versus managed API providers. We will ignore the "train a frontier model" path because if you have $100M in GPU budget, you are not reading this article.
| Feature | Self-Hosted LLM | API LLM |
|---|---|---|
| Upfront cost | $0 (CPU) to $2,000+ (GPU) | $0 |
| Per-request cost | Electricity only | $0.10–$60 per 1M input tokens |
| Latency control | Full (local hardware) | Depends on provider load |
| Model selection | Open-weight only (Llama, Mistral, Gemma, etc.) | Proprietary + open-weight |
| Data privacy | Nothing leaves your machine | Data sent to third-party servers |
| Scaling past 1 user | Buy more GPUs or queue | Scales automatically |
| Maintenance | You handle updates, quantization, infra | Provider handles everything |
API pricing: cheap per call, expensive at volume
Managed providers price by the million tokens, split between input and output. As of mid-2026, the range from LLM Price Check and Price Per Token looks roughly like this:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | ~$2.50 | ~$10.00 |
| Claude 4 Sonnet | ~$3.00 | ~$15.00 |
| Gemini 2.5 Flash | ~$0.15 | ~$0.60 |
| DeepSeek V4 (API) | ~$0.27 | ~$1.10 |
| Llama 3 70B (Groq) | ~$0.64 | ~$0.80 |
For light usage (a few hundred requests a day, short prompts), this is genuinely cheap. A solo developer prototyping an AI feature might spend $5 to $20 per month.
The math changes at scale. Red Hat's analysis of hidden LLM costs estimates that a small startup handling 10,000 requests per day with a capable model could spend $15,000 per month on API calls. A Dell Technologies white paper on inference costs pegged GPT-4o at roughly $12.19 per user per month for a 70B-class workload, which compares to the $20 to $30 per user that suite-based AI tools like Copilot or ChatGPT Team charge.
The takeaway: API pricing is linear. Double your traffic, double the bill. There is no volume discount that changes the curve dramatically.
Self-hosting: high floor, flat ceiling
Running a model locally means you pay for hardware once (or rent a GPU instance monthly) and then tokens are essentially free after that.
The minimum viable setup for local inference:
- CPU-only (free hardware you already own): Tools like Ollama or llama.cpp run quantized models on a laptop CPU. A 7B model at Q4 quantization needs about 4 GB of RAM. It works, but expect 5 to 15 tokens per second on a modern laptop. Adequate for personal use, not for serving users.
- Consumer GPU ($300 to $1,500): An RTX 4060 (8 GB VRAM, ~$300 used) runs 7B to 13B models comfortably. An RTX 3090 (24 GB VRAM, ~$700 used) handles 30B to 70B quantized models. Token throughput jumps to 30 to 80 tokens per second depending on model size and quantization level.
- Prosumer GPU ($1,500 to $4,000): An RTX 4090 (24 GB VRAM) or used A6000 (48 GB VRAM) gives you headroom for 70B models at higher quantization or multiple concurrent users.
Electricity cost is real but modest. An RTX 4090 under inference load draws about 300W. At $0.15/kWh (US average), running inference 8 hours a day costs roughly $10.80 per month. Even 24/7 inference sits around $32 per month in electricity.
Self-Hosted LLM
Pros
- Zero per-token cost after hardware purchase
- Complete data privacy
- No rate limits or downtime from provider outages
- Full control over quantization, context length, and model choice
Cons
- Upfront hardware investment ($300–$4,000 for useful inference)
- You maintain the stack: model updates, driver compatibility, quantization tuning
- Scaling to multiple concurrent users requires more GPUs or a serving framework like vLLM
- Open-weight models still trail frontier proprietary models on complex reasoning
The breakeven calculation
The crossover point depends on your request volume and model tier. Here is a rough sketch.
Suppose you use a model comparable to Llama 3 70B. On Groq's API, that costs about $0.64 per million input tokens and $0.80 per million output tokens. A typical request (1,000 input tokens, 500 output tokens) costs roughly $0.001.
A used RTX 3090 costs about $700. At 1,000 requests per day, API cost is roughly $30 per month. The GPU pays for itself in about 23 months. At 10,000 requests per day, API cost is $300 per month and the GPU pays for itself in about 2.5 months.
For frontier-tier models (GPT-4o, Claude 4 Sonnet), the API costs are 5 to 15 times higher per token, but you cannot self-host those models. That is the real constraint: if your workload demands proprietary frontier reasoning, self-hosting is not an option regardless of cost.
When self-hosting wins
Self-hosting makes financial sense when three conditions align: you can tolerate open-weight model quality, you have enough daily volume to amortize hardware, and you value data privacy or latency control. Batch processing pipelines (embeddings, classification, summarization over large datasets) are the clearest win. You saturate the GPU, pay nothing per token, and the economics are overwhelming.
For production serving with concurrent users, a framework like vLLM adds batching and paged attention, pushing throughput significantly higher than naive single-request inference. If you are comparing local inference runners, our breakdown of Jan vs LM Studio covers the GUI-friendly side of the tooling landscape.
When APIs win
APIs win when you need frontier model quality, unpredictable or bursty traffic, or zero ops burden. A startup with three engineers shipping an AI feature should almost certainly start with an API. The $50 to $200 per month API bill during early product development is cheaper than the engineering time to set up and maintain an inference stack. Migrate to self-hosting later if the usage pattern stabilizes and the volume justifies hardware.
APIs also win when you need the latest model within days of release. Self-hosting lags proprietary releases by months or permanently (you will never self-host GPT-5).
What about fine-tuning?
Fine-tuning a pre-trained model sits between "train from scratch" and "just use it." Costs range from a few hundred dollars (fine-tuning a 7B model on a rented A100 for a few hours) to tens of thousands (fine-tuning a 70B model on a large dataset). Galileo's training cost analysis estimates 60 to 90 percent cost savings compared to training from scratch. Most developers will use LoRA or QLoRA adapters, which reduce GPU memory requirements drastically and bring fine-tuning within reach of a single 24 GB consumer GPU.
The bottom line in real numbers
For a developer asking "how much is an LLM," here is the honest range:
- Casual personal use (local, CPU): $0.
- Serious local inference (one GPU): $300 to $1,500 once, plus ~$10 to $30 per month in electricity.
- API for a prototype: $5 to $50 per month.
- API for a production app (10k+ requests/day): $300 to $15,000+ per month.
- Fine-tuning an existing model: $100 to $50,000 depending on model size and dataset.
- Training a frontier model from scratch: $50 million to $100 million+.
Pick the row that matches your actual workload. The cheapest LLM is the one you run locally on hardware you already own. The most capable LLM is the one you rent from an API. The right answer is almost always to start with the API and self-host the workloads that prove expensive enough to justify the migration.
Related comparisons
LLM vs Foundation Model: What Developers Actually Need to Know
Every LLM is a foundation model, but not every foundation model is an LLM. Here is what that hierarchy means for your architecture decisions, model selection, and deployment.
Read comparison →Local LLMsGenerative AI vs LLMs: What Developers Actually Need to Know
LLMs are a subset of generative AI, not a synonym. Here is what each term actually covers, where they overlap, and why the distinction matters when you are picking tools.
Read comparison →Local LLMsOllama vs LM Studio API: Which Local LLM Server Fits Your Stack in 2026
Both Ollama and LM Studio expose OpenAI-compatible local LLM APIs, but they target different workflows. We compare server setup, endpoint coverage, and integration tradeoffs so you can pick the right one.
Read comparison →Local LLMsLocal LLM Box: Dedicated Hardware vs. Desktop Software for Running Models at Home
A dedicated local LLM box promises always-on inference without tying up your workstation. We compare purpose-built hardware against running Ollama or LM Studio on the machine you already own.
Read comparison →