Article Issue #5195

Prompt Caching

What to know

Prompt Caching is an inference optimization where the computed attention KV cache for a stable, reused portion of a prompt is stored server-side, allowing the model to skip recomputing those layers for subsequent requests that begin with the same prefix; When a request is sent with a cache breakpoint marker, the provider computes and stores the KV cache up to that point; Prompt caching is among the highest-ROI optimizations available for production LLM applications that make many calls with a common large prefix

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

Prompt Caching is an inference optimization where the computed attention KV cache for a stable, reused portion of a prompt is stored server-side, allowing the model to skip recomputing those layers for subsequent requests that begin with the same prefix. Providers like Anthropic and OpenAI offer prompt caching APIs that deliver cost discounts of up to 90 percent on cached input tokens.

How it works

When a request is sent with a cache breakpoint marker, the provider computes and stores the KV cache up to that point. On subsequent requests with the same prefix up to the breakpoint, the provider loads the cached KV state and only computes attention for the new tokens that follow the breakpoint. The cache is keyed by the exact token sequence of the prefix; any change invalidates it.

Key facts

Anthropic pricing: Cached input tokens are priced at roughly 10 percent of the standard input token rate for Claude models.
TTL: Anthropic’s prompt cache has a 5-minute TTL by default; entries must be refreshed by re-requesting within that window.
Latency benefit: Cached prefixes also reduce time-to-first-token because fewer attention computations are required.
Best candidates: Static system prompts, large document contexts, tool schemas, and conversation history are prime caching targets.

For builders

Prompt caching is among the highest-ROI optimizations available for production LLM applications that make many calls with a common large prefix. For document analysis products that append the same system prompt and document to many different user questions, caching can reduce costs by 70 to 90 percent. Structure prompts with stable content at the top and variable content at the bottom to maximize cache hit rates.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.