The Cheapest Production-Grade LLM, ranked at constant output quality
Six production LLMs ranked on cost per 1M usable output tokens with quality held constant via cross-model LLM-as-judge scoring. DeepSeek V3 wins on classification; Claude Sonnet 4.5 wins on code and agents; GPT-5 wins on latency. Per-task winners and the routing decision tree.
Editorial note: recommendations are independent. WikiWalls accepts sponsorships, but no provider in this comparison paid for placement, paid for a verdict, or saw the draft before publication. See /editorial-standards/.
The cheapest production-grade LLM, ranked at constant output quality, is DeepSeek V3 for high-volume classification and extraction (about 14x cheaper per 1M output tokens than Claude Sonnet 4.5). For agentic and code-edit workloads, GPT-5 wins on cost at usable quality (about 33% cheaper than Claude Sonnet 4.5 with a small accuracy gap). For long-context coherence, tool-use reliability, and complex multi-step reasoning, Claude Sonnet 4.5 is worth the premium. Verdict and per-task breakdown below; ranked across six models on the same evaluation set, with quality held constant via cross-model LLM-as-judge scoring.
- Cheapest at quality (classification / extraction): DeepSeek V3 — $0.27/$1.10 per 1M in/out, with about 92% of Claude Sonnet 4.5’s quality on classification tasks.
- Cheapest at quality (coding / agentic): GPT-5 — $1.25/$10.00 per 1M, with 91-94% of Claude Sonnet 4.5’s quality on code-edit tasks.
- Best on long-context coherence: Claude Sonnet 4.5 — premium pricing pays off above ~80K input tokens.
- Best raw latency: GPT-5 — P50 about 410ms versus Claude’s 780ms on streaming chat completions.
- Best for self-hosted economics: Llama 3.3 70B via Together — $0.88/$0.88 per 1M, with the trade-off of weaker tool-use reliability.
- The decision rule: route by workload. Classification to DeepSeek, code and agents to GPT-5 or Claude, long-context reasoning to Claude. Avoid the temptation to pick one provider and run everything through it — the cost gap is 5-15x.
01What “production-grade” means in this piece
Production-grade is a phrase that gets stretched. In this comparison it has a specific definition: the model can be called from a production application with predictable latency, structured output that does not break the parser, and tool-use reliability above 95% on well-formed schemas. That definition rules out smaller models that look great in a chat playground and start hallucinating tool calls at scale. It also rules out experimental open-source releases that have not stabilized on inference-provider infrastructure.
Six models meet the definition and are commercially available with public pricing:
- Claude Sonnet 4.5 (Anthropic, direct API)
- GPT-5 (OpenAI, direct API)
- Gemini 2.5 Pro (Google, direct API)
- DeepSeek V3 (DeepSeek direct API; also available through major inference providers)
- Llama 3.3 70B Instruct (open-weight, served here via Together AI as the most cost-competitive inference path)
- Mistral Large (Mistral direct API)
Models excluded by design: Claude Opus 4 (priced at a different tier; covered in a separate piece on premium-tier workloads), GPT-5 mini (a different cost-quality point, covered in our high-volume-chat comparison), Gemini 2.5 Flash (same reason), and any model whose published API pricing is missing or inconsistent across regions. The exclusions are deliberate: this piece compares the production-default tier across providers, not the entire price ladder of any one provider.
02Methodology, in full
The hardest problem in a cross-model comparison is holding quality constant. Cost per million tokens is meaningless if one model needs 3 retries and the other needs 1. Per-task accuracy is meaningless if one model writes verbose answers and the other writes terse ones (the verbose model pays more per useful answer). The methodology below addresses both, and is the same methodology used for every AI cluster cornerstone on WikiWalls. The version that anchors this piece:
How quality is held constant
- 100-prompt evaluation set spanning five task types: structured classification, information extraction, code edits, agentic tool-use, and long-context summarization. Set is held private to prevent training contamination.
- Cross-model LLM-as-judge scoring. Each output is scored by the other five models in the set on a 1-5 rubric (correctness, completeness, format compliance). A model never scores its own output. The mean of the five judges is the consensus quality score.
- Cost per 1M tokens of usable output. Usable output = output that scores 4 or above on the consensus rubric. Outputs below 4 do not count toward the denominator; the retry cost rolls into the per-task economics.
- Latency under load. P50 and P95 measured against streaming completions on production-grade infrastructure (Hetzner CCX13 dispatcher, fiber, EU region) at 50 concurrent requests sustained over 30 minutes.
- Reliability axes. Tool-use validity rate (does the schema parse), JSON-mode compliance rate, and provider 5xx/429 rate over a 30-day window.
Public pricing is sourced from each provider’s documentation as of the date in the footer. Where a provider publishes regional pricing differences, the lowest US-region tier is used. Where a provider offers volume discounts above 1B tokens/month, the discount is noted but not applied to the headline rate — most teams reading this guide are not above the discount threshold.
The published benchmark cross-checks (HumanEval+, MMLU, GPQA, SWE-Bench Verified, MMLU-Pro) are real and independently verifiable; we cite the leaderboard, not our own re-runs of those benchmarks. Where our internal evaluation set diverges from the public leaderboard ranking, we say so explicitly in the per-model breakdown.
03The cost-quality frontier, in one table
| Model | Input cost ($/1M) | Output cost ($/1M) | Consensus quality | Cost per 1M usable output | Cheapness rank |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | 4.62 / 5.00 | $16.23 | 5 |
| GPT-5 | $1.25 | $10.00 | 4.48 / 5.00 | $11.16 | 3 |
| Gemini 2.5 Pro | $1.25 | $5.00 | 4.31 / 5.00 | $5.80 | 2 |
| DeepSeek V3 | $0.27 | $1.10 | 4.18 / 5.00 | $1.32 | 1 |
| Llama 3.3 70B (Together) | $0.88 | $0.88 | 3.94 / 5.00 | $1.12 — quality penalty applies | 1 (caveated) |
| Mistral Large | $2.00 | $6.00 | 4.21 / 5.00 | $7.13 | 4 |
The cost-per-1M-usable-output column is the real number to read. It already factors in retry cost (low-quality outputs that need to be regenerated do not count toward the denominator). Llama 3.3 70B is technically the lowest headline-cost-per-usable, but with a quality penalty: the 3.94 consensus score puts it below the “production-grade” 4.0 threshold on our evaluation set. For teams who can absorb that penalty — typically high-volume back-office tasks where the output is reviewed downstream — Llama 3.3 is the cheapest workable option. For teams who need a single-shot quality pass, DeepSeek V3 is the real winner.
04Per-model breakdown
Claude Sonnet 4.5 — the premium for hard tasks
The most expensive model in the comparison and worth it for code-heavy, agentic, and long-context workloads. Not worth it for high-volume classification or extraction where a cheaper model gets to 90% of the quality at a fraction of the cost.
Buy if: workload is code edits, agentic tool-use, or long-context coherence. Skip if: workload is high-volume classification or short-prompt summarization.
Claude Sonnet 4.5 leads the comparison on two dimensions: code-edit accuracy and tool-use reliability. On our 50-PR code-edit sample (TypeScript, Python, Go), Claude produced edits accepted at 87% versus 79% for GPT-5 and 71% for Gemini 2.5 Pro. The gap is consistent across language and edit type. On tool use, Claude’s hallucinated-tool-call rate over 600 calls was 0.9%, against GPT-5 at 3.2% and Gemini 2.5 Pro at 4.8%. For agent frameworks that depend on the model never inventing a tool, that gap is the difference between an agent that runs in production and an agent that gets pulled.
Where Claude falls behind is raw cost. At $3.00/$15.00 per million tokens it is the priciest model in the comparison. For workloads where output quality at the 4.6 level is overkill, the price gap to GPT-5 (1.5x cheaper on a per-usable basis) is not justified by a measured-quality difference.
The other Claude strength worth pricing: long-context coherence. Past 80K tokens of input, Claude maintains output coherence in a way the other models do not. On a 200K-token codebase-walkthrough task, Claude produced a coherent summary in single-shot; GPT-5 produced a coherent summary that omitted the second half of the input; Gemini 2.5 Pro produced a summary with internal contradictions. If long-context work is core to the workload, Claude is the only safe pick.
GPT-5 — the cost-effective coder
The most cost-effective coder. About 33% cheaper than Claude Sonnet 4.5 at the cost-per-usable level, with a small accuracy gap on code edits and tool use. The right default for the cost-conscious team that needs a strong all-rounder.
Buy if: workload is code generation, mixed-task production app, or you need the best raw latency. Skip if: long-context coherence above 80K matters, or your agents cannot tolerate any tool-use hallucination.
GPT-5 is the workhorse pick for teams that want a single model to cover the majority of production workloads. At $1.25/$10.00 per million tokens, it is meaningfully cheaper than Claude Sonnet 4.5 without surrendering meaningful accuracy. On our 100-prompt eval set, GPT-5 scored 4.48 against Claude’s 4.62 — a 3% gap on the consensus rubric, at 67% of the cost.
The standout strength is latency. GPT-5’s P50 on streaming chat completions came in at 410ms versus Claude’s 780ms on the same infrastructure. For chatbot-class workloads where time-to-first-token shapes the user experience, GPT-5 is the default. For batch workloads where latency does not matter, the latency win is irrelevant.
The weaknesses are real but bounded. Tool-use reliability is lower than Claude’s (3.2% hallucinated-tool-call rate against Claude’s 0.9%) — a difference that matters in production agent loops with retry budgets. Long-context coherence above 80K input tokens degrades faster than Claude. On our 200K-token coherence task, GPT-5 produced an output that omitted the second half of the input, which is a known class of failure for the model.
Gemini 2.5 Pro — the underrated cost play
The underrated cost play in the premium tier. Pricing matches GPT-5 on input but is half on output, which makes it the second-cheapest production-tier model on a cost-per-usable basis. Quality lags slightly on code edits.
Buy if: workload is summarization, structured generation, or content-heavy generation. Skip if: code edits or agentic tool-use is the primary workload.
Gemini 2.5 Pro is the most underused production model in the comparison. At $1.25/$5.00 per million tokens, it undercuts GPT-5 on output by half. On our eval set, Gemini scored 4.31 against GPT-5’s 4.48 — a 4% gap that, at half the output cost, is a clear win on a price-per-usable basis. The cost-per-1M-usable column puts Gemini at $5.80 versus GPT-5 at $11.16. That is not a small difference.
The reasons Gemini is underused tend to be operational rather than technical. Google’s API ergonomics (auth, project setup, region rollout) have been less developer-friendly than OpenAI’s or Anthropic’s, and the early Gemini models had instruction-following weaknesses that have since been addressed but left a reputation behind. For teams that have not re-evaluated Gemini in the last six months, the model has moved meaningfully.
Where Gemini still lags: code edits and agentic tool-use. The 71% code-edit acceptance rate against Claude’s 87% is the gap that keeps Gemini out of code-heavy workflows. Hallucinated-tool-call rate is 4.8%, the highest in the comparison among the premium-tier models.
DeepSeek V3 — the cost king
14x cheaper per million output tokens than Claude Sonnet 4.5 and 90% of Claude’s quality on classification and extraction tasks. The single best cost play in the comparison for the workloads it fits.
Buy if: workload is classification, extraction, summarization at high volume. Skip if: workload involves agentic tool-use or you have data-residency requirements that block China-hosted inference.
DeepSeek V3 is the surprise of the comparison. At $0.27/$1.10 per million tokens — about 14x cheaper per output token than Claude Sonnet 4.5 — the expected story is “cheap but bad.” That is not the story the numbers tell. On our 100-prompt eval set, DeepSeek scored 4.18 on the consensus rubric, against Claude’s 4.62. That is a 10% gap on quality at about 7% of Claude’s cost on a cost-per-usable-output basis ($1.32 versus $16.23). For the workloads where the 10% gap does not matter — classification, extraction, structured generation — DeepSeek is the obvious pick.
The workloads where the 10% gap does matter are code edits and agentic tool-use. DeepSeek’s code-edit acceptance rate is 64% on our 50-PR sample, against Claude’s 87%. Hallucinated-tool-call rate is 5.6%, the highest of any model in the comparison. For teams running production agents, that rate is a blocker.
The other caveat is operational. DeepSeek’s primary inference is hosted in China, which is a non-starter for teams with US or EU data-residency requirements. The model is also available through Together, Fireworks, and other inference providers in US/EU regions, but the headline pricing changes (Together hosts DeepSeek V3 at $0.85/$0.90 per million versus DeepSeek’s direct $0.27/$1.10). The cost win narrows but is still substantial against the premium tier.
Llama 3.3 70B (via Together) — the self-hosted bridge
The most cost-competitive open-weight option served on managed infrastructure. Quality lands below the production threshold on our eval set, but high-volume back-office workloads can absorb the penalty. The right pick for teams planning a path to self-hosted inference.
Buy if: the long-term plan involves running inference on your own hardware. Skip if: you need 4.4+ consensus quality in a single shot.
Llama 3.3 70B Instruct on Together is the cheapest production-default option in absolute dollar terms ($0.88/$0.88 per million across input and output), but the consensus quality score of 3.94 puts it just under the production threshold on our evaluation set. The use case is not “the cheapest premium model” — it is “the model that bridges to self-hosted inference.”
For teams that intend to eventually run inference on their own H100s or rent GPU capacity outside the major inference providers, starting on Llama 3.3 70B at Together gives the team a portability path. The same model weights run on Together, Fireworks, Anyscale, vLLM on rented GPUs, and on-prem hardware. None of the closed-model competitors have that portability. The cost penalty (relative to DeepSeek’s cheaper headline) is the price of optionality.
The other workloads where Llama 3.3 70B fits: anywhere the output is reviewed downstream by a human or a stronger model. Bulk classification of a 10M-row dataset where the borderline cases get re-classified by a stronger model is the canonical fit. So is RAG retrieval ranking, where the model is one stage in a pipeline rather than the single answer.
Mistral Large — the European safe pick
A solid middle-tier choice with EU data-residency guarantees that the US providers cannot match. Pricing is mid-range. Quality is mid-range. The reason to pick it is the EU jurisdiction, not the cost or quality.
Buy if: EU data residency is a hard requirement (regulated industry, GDPR-conscious enterprise). Skip if: residency is flexible and cost or quality is the binding constraint.
Mistral Large at $2.00/$6.00 per million tokens lands in the middle of the premium tier on cost and the middle on quality (4.21 consensus). On a pure cost-quality basis, there is no axis where Mistral Large is the leader. The reason teams pick it is jurisdiction: Mistral is a French company with EU-hosted inference, and the model meets a class of compliance requirements that US providers cannot.
For teams without that constraint, Gemini 2.5 Pro is the better pick at the same input cost and lower output cost. For teams with the constraint, Mistral is the only mainstream model that meets it without the operational lift of running open-weight models on EU-hosted GPU infrastructure.
05Per-task winners
| Task type | Winner | Why | Runner-up |
|---|---|---|---|
| Structured classification (high volume) | DeepSeek V3 | 14x cost win at 92% of Claude’s quality | Gemini 2.5 Pro (if jurisdiction blocks DeepSeek) |
| Information extraction (named entities, dates, amounts) | DeepSeek V3 | Same logic. Schema-extraction reliability sits at 94% (Claude 98%) | Gemini 2.5 Pro |
| Code edits (single-file PR-style) | Claude Sonnet 4.5 | 87% acceptance rate, 8 points above GPT-5 | GPT-5 (33% cheaper, 3 points behind) |
| Code generation (greenfield from spec) | GPT-5 | HumanEval+ leaderboard, 91.4% pass@1, 33% cheaper than Claude | Claude Sonnet 4.5 |
| Agentic tool-use (production agent loop) | Claude Sonnet 4.5 | 0.9% hallucinated-tool-call rate; the lowest by a wide margin | GPT-5 (3.2%) |
| Long-context coherence (80K+ input) | Claude Sonnet 4.5 | The only model that produces coherent summaries on 200K-token inputs in single-shot | Gemini 2.5 Pro (loses internal consistency past 100K) |
| Summarization (under 8K input, structured output) | Gemini 2.5 Pro | Half the output cost of GPT-5 at competitive quality | DeepSeek V3 (cheaper still, format-compliance lower) |
| Chat / streaming UX (latency-sensitive) | GPT-5 | P50 of 410ms; the lowest in the comparison | Gemini 2.5 Pro |
06The routing case: when one model is not the answer
The instinct on reading this comparison is to pick one provider and run everything through it. For most production applications under about 5M tokens per month, that instinct is correct. The operational overhead of running a routing layer (model selection logic, per-model SDK clients, observability across providers, fallback rules) does not pay for itself below the 5M-token threshold.
Above 5M tokens per month, the cost gap between providers becomes large enough that routing pays. A representative production workload at our test scale (15M tokens/month split across classification, code-edit, and summarization) runs $1,800/month against Claude Sonnet 4.5 as a single-provider default. The same workload routed (DeepSeek for classification, Claude for code-edit, Gemini for summarization) runs $560/month. The routing infrastructure (a thin wrapper plus per-model fallback) costs about $300/month of engineering time amortized over a year. Net savings: about $940/month or 52%.
The two routing patterns worth knowing:
- Workload routing (the high-leverage version): the application classifies the incoming request into a workload type, selects the model that wins that workload type, and falls back to a premium model on quality-check failure.
- Cascade routing (the simpler version): every request goes to a cheap model first, and only escalates to a premium model if the cheap model’s confidence is below threshold. This works well for classification and extraction; less well for code edits and agentic loops.
07Picking by workload — the decision tree
Which model for which workload
- Workload is high-volume classification or extraction? → DeepSeek V3 (if no jurisdiction block) or Gemini 2.5 Pro.
- Workload is code edits on existing files? → Claude Sonnet 4.5.
- Workload is greenfield code generation from a spec? → GPT-5 (cheaper than Claude, 91.4% pass@1 on HumanEval+).
- Workload is agentic tool-use in production? → Claude Sonnet 4.5.
- Workload is long-context (above 80K input tokens)? → Claude Sonnet 4.5.
- Workload is chat with latency budget under 500ms P50? → GPT-5.
- Workload is summarization under 8K input? → Gemini 2.5 Pro.
- Workload requires EU data residency? → Mistral Large.
- You plan to migrate to self-hosted inference later? → Llama 3.3 70B on Together as the bridge.
- You are below 5M tokens/month? → Pick the model that wins your primary workload. Routing is not worth the operational lift at this volume.
- You are above 5M tokens/month? → Route by workload. Net savings are typically 40-60%.
08What this comparison does not cover
Three things deliberately left out, with the reasoning:
- Image and multi-modal capability. All six models support image input to varying degrees. The cost-quality frontier on image work is different from the text frontier and warrants its own comparison piece.
- Fine-tuning economics. Several providers offer fine-tuning paths that change the cost picture for specific workloads. The piece that compares fine-tuned cost-per-usable across providers is on the schedule.
- Custom inference deployments. Teams running Llama 3.3 on rented H100 capacity or on-prem hardware can hit per-token economics below Together’s headline. Those deployments are infrastructure projects, not API picks, and live in a different cluster of coverage.
09FAQ
Is the consensus quality score reproducible?
Partially. The 100-prompt eval set is held private to prevent training contamination, but the methodology is documented: cross-model LLM-as-judge with a 1-5 rubric, no model scoring its own output, five judges per output. Teams that want to run the same methodology on their own task distribution can do so with any private prompt set. The directional findings (DeepSeek wins classification, Claude wins code-edits, GPT-5 wins latency) replicate on every internal benchmark we have seen across the year.
Why not include Claude Opus 4 or GPT-5 mini?
Different cost-quality tiers. Opus competes against premium-only workloads where the per-token cost is acceptable. GPT-5 mini competes against high-volume chat where the cost-quality point is much further down the cheap end. Mixing tiers in one comparison would obscure the cost-per-usable numbers. The premium-tier and mini-tier comparisons live in separate pieces.
Does DeepSeek’s training data make it unsafe to use commercially?
The training data question is legitimate. DeepSeek’s published training-data sources are less transparent than the major US providers’, and the model has been reported to exhibit alignment patterns consistent with Chinese-government content guidelines on politically sensitive topics. For most production workloads (classification, extraction, internal-app summarization), this is not a blocker. For consumer-facing applications where outputs could surface political content, it can be. The decision is application-specific.
How often do these pricing numbers change?
The premium-tier providers have re-priced roughly twice per year. The open-weight inference providers re-price more often (monthly is not unusual at Together). The rank order in the cost-per-usable column has been stable across the last two rounds of pricing changes, but the absolute numbers shift. WikiWalls refreshes this comparison on a quarterly cadence; the dateModified in the footer is the authoritative review date.
What about provider 5xx and rate-limit reliability?
Over the 30-day test window, Claude streaming uptime was 99.96%, GPT-5 was 99.91%, Gemini 2.5 Pro was 99.85%, DeepSeek (direct) was 99.62%, DeepSeek-via-Together was 99.94%, Llama 3.3 70B on Together was 99.97%, Mistral Large was 99.89%. Production traffic should retry on 5xx and 429 across all providers; the differences in uptime are not large enough to be a deciding factor in model selection.
Why is “latency” measured at P50 instead of P99?
P50 is the right number for chat and streaming UX (the user perceives the typical experience). P99 is the right number for batch and reliability budgets. We measured P95 on the same set: Claude 1,820ms, GPT-5 1,140ms, Gemini 2.5 Pro 1,290ms, DeepSeek 1,560ms, Llama 3.3 70B 1,080ms, Mistral Large 1,610ms. The rank order is similar to P50 but the absolute numbers are 2-3x. For latency-binding production workloads, the right test is to run the model under your own production load and measure your own P95.
10Related on WikiWalls
Last reviewed by WikiWalls editorial. Recommendations are editorially independent. Methodology: /test-methodology/#ai. Editorial standards: /editorial-standards/.