Cost Per Token
Cost Per Token is the pricing unit for language model API usage, typically expressed in US dollars per million tokens and split into separate input (prompt) and output (completion) rates; A provider tokenizes each request, counts input tokens (everything in the prompt, including system message, history, and retrieved context), and counts output tokens (everything the model generates in response); Unit economics analysis is essential before scaling any LLM feature
Cost Per Token is the pricing unit for language model API usage, typically expressed in US dollars per million tokens and split into separate input (prompt) and output (completion) rates. Because output generation is more compute-intensive than input processing, output tokens are generally priced 3 to 5 times higher than input tokens for the same model.
How it works
A provider tokenizes each request, counts input tokens (everything in the prompt, including system message, history, and retrieved context), and counts output tokens (everything the model generates in response). Total cost equals (input tokens / 1,000,000) times input price plus (output tokens / 1,000,000) times output price. Cached input tokens, if applicable, are counted at the lower cached rate.
Key facts
- Typical ranges (2025): Frontier models cost 1 to 15 USD per million input tokens; open-weight hosted models cost 0.1 to 1 USD.
- Output premium: Output tokens typically cost 3 to 5 times more than input tokens for the same model.
- Context window impact: Sending large contexts on every request quickly dominates total cost; prompt caching and RAG reduce this.
- Volume discounts: Enterprise agreements and committed usage tiers unlock significant discounts below list pricing.
For builders
Unit economics analysis is essential before scaling any LLM feature. The standard exercise is to estimate average input and output token counts per request, multiply by expected call volume, and project monthly cost at list prices. Prompt caching, model tiering (routing simpler tasks to cheaper models), and output length constraints are the primary levers for reducing cost once the baseline is established.
Sources
- Zhao, W. X., et al. (2023). A Survey of Large Language Models. arXiv:2303.18223. arxiv.org
- Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI. usenix.org
- Kwon, W., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. vllm-project. github.com
- Anthropic. Claude API documentation. docs.anthropic.com
- OpenAI. API reference. platform.openai.com