Prompt Caching
Prompt Caching is an inference optimization where the computed attention KV cache for a stable, reused portion of a prompt is stored server-side, allowing the model to skip recomputing those layers for subsequent requests that begin with the same prefix; When a request is sent with a cache breakpoint marker, the provider computes and stores the KV cache up to that point; Prompt caching is among the highest-ROI optimizations available for production LLM applications that make many calls with a common large prefix
Prompt Caching is an inference optimization where the computed attention KV cache for a stable, reused portion of a prompt is stored server-side, allowing the model to skip recomputing those layers for subsequent requests that begin with the same prefix. Providers like Anthropic and OpenAI offer prompt caching APIs that deliver cost discounts of up to 90 percent on cached input tokens.
How it works
When a request is sent with a cache breakpoint marker, the provider computes and stores the KV cache up to that point. On subsequent requests with the same prefix up to the breakpoint, the provider loads the cached KV state and only computes attention for the new tokens that follow the breakpoint. The cache is keyed by the exact token sequence of the prefix; any change invalidates it.
Key facts
- Anthropic pricing: Cached input tokens are priced at roughly 10 percent of the standard input token rate for Claude models.
- TTL: Anthropic’s prompt cache has a 5-minute TTL by default; entries must be refreshed by re-requesting within that window.
- Latency benefit: Cached prefixes also reduce time-to-first-token because fewer attention computations are required.
- Best candidates: Static system prompts, large document contexts, tool schemas, and conversation history are prime caching targets.
For builders
Prompt caching is among the highest-ROI optimizations available for production LLM applications that make many calls with a common large prefix. For document analysis products that append the same system prompt and document to many different user questions, caching can reduce costs by 70 to 90 percent. Structure prompts with stable content at the top and variable content at the bottom to maximize cache hit rates.
Sources
- Zhao, W. X., et al. (2023). A Survey of Large Language Models. arXiv:2303.18223. arxiv.org
- Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI. usenix.org
- Kwon, W., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. vllm-project. github.com
- Anthropic. Claude API documentation. docs.anthropic.com
- OpenAI. API reference. platform.openai.com