The Physics of Prompt Caching

Last month I was watching the token counter climb on a multi-turn agent session and I noticed something that made me stop and open the API response object. The agent had been running for maybe eight minutes. It was re-reading an 80,000-token context on every single turn. Tool definitions, system prompt, the full conversation history, every tool result, all of it. Fresh every time.

I checked the cache_read_input_tokens field in the response. It was zero. Every turn.

Prompt caching was supposed to be on. I had read the docs. I thought it was working. It was not.

That is not a caching problem. It is a misunderstanding of how prompt caching actually works. Once I understood the mechanics, the fix was two lines of code and the next session cost about one-tenth of what the previous one did. This post is what I wish I had read before that session.

See the numbers first: the Prompt Caching Demo shows cache_write, cache_read, and hit rate updating in real time as you chat. Ask a question, then ask a follow-up. The KV cache mechanics this article explains become concrete once you see the metrics move.

What the KV Cache Actually Is#

When a transformer processes a sequence of tokens, the attention mechanism computes a Key matrix and a Value matrix for each token. These KV tensors live in GPU VRAM. They are expensive to compute and they encode every token's relationship to every preceding token.

Because transformer attention is autoregressive, token N's values depend on tokens 0 through N-1. That dependency is not a design choice. It is the fundamental structure of how attention works. You cannot recompute token N in isolation. It needs its full prefix.

Prompt caching stores those precomputed KV tensors. When a new request arrives with the same prefix, the model does not recompute the attention for those tokens. It reads the KV matrices from cache, which is about ten times cheaper than computing them fresh.

This is why the whole system is called a KV cache, not a "prompt cache" or a "text cache." It is not storing your prompt text. It is storing intermediate computation artifacts that let the GPU skip work.

KV Cache miss vs hit: cold start computes KV pairs for every token; warm cache skips cached tokens and only processes new ones, cutting cost by up to 90%

Prefix Matching Is Physics, Not Policy#

The constraint that shapes everything about prompt caching is this: the cache key is cumulative. Every block's hash depends on all content before it.

A single extra space anywhere in the prefix creates a completely new cache entry. A trailing newline that appears on one request and not the next invalidates the cache. This is not Anthropic being strict. It follows directly from the dependency structure above. Token N's KV values depend on tokens 0 through N-1. Change any of those earlier tokens and all the KV values after that point are wrong. The cache has to be invalidated.

The serialization order Anthropic uses is: tools first, then system prompt, then messages. This order matters because it determines where cache invalidations cascade from. If you modify your tool definitions, you blow the cache for the system prompt and all messages too. If you only modify the system prompt, the tools cache survives but the messages cache does not. If you only append a new message, everything before it is preserved.

This hierarchy is why the advice "keep your system prompt stable" exists. It is not just a performance tip. The system prompt sits above messages in the serialization order. Every character change there invalidates everything downstream.

Two Ways to Use Caching#

Anthropic gives you two modes.

Automatic caching is turned on by passing a cache_control field at the top level of the request body rather than on specific content blocks. The system finds the last cacheable block in the serialized request and places the cache breakpoint there. As the conversation grows and new messages are appended, the breakpoint advances automatically. You do not have to manage it.

This is the right starting point. It handles the common case well: you have a long, stable prefix (tools plus system prompt) and a growing message history, and you want the model to avoid recomputing the stable part.

Explicit per-block breakpoints give you fine-grained control. You place {"type": "ephemeral"} on specific content blocks, up to four per request. The system caches up to each marked block independently. This is useful when you have multiple stable sections at different depths, for example a large system prompt with a long tools list and a separately stable conversation segment from earlier in the session.

One thing worth noting: when you combine automatic caching with explicit block breakpoints, automatic caching uses one of your four available breakpoint slots. So in practice you have three explicit slots left.

Minimum Token Thresholds#

This is where a lot of silent failures happen. If the prefix you want cached is shorter than the minimum threshold for your model, the request succeeds and returns normally, but nothing gets cached. The API does not error. It does not warn you. cache_creation_input_tokens just stays at zero.

The thresholds by model:

Model	Minimum tokens to cache
Sonnet 4.5, Sonnet 4, Sonnet 3.7, Opus 4, Opus 4.1	1,024
Sonnet 4.6	2,048
Opus 4.5, Opus 4.6	4,096
Haiku 4.5	4,096
Haiku 3.5, Haiku 3	2,048

If you are using Haiku 4.5 with a 2,000-token system prompt and expecting caching to kick in, it will not. You need to hit 4,096 tokens before any caching happens. For short system prompts with small tool sets, you might never cross that threshold, and the cache_control fields will have no effect.

My session was on an older Opus model where the threshold was 1,024 tokens. That was not the problem. The problem was something else in the prefix that kept changing between requests. More on that in the invalidation section.

Cache Lifetime and TTL Options#

The default TTL for a cache entry is 5 minutes. Every cache hit refreshes the TTL at no extra cost. So as long as your agent is making requests at least every five minutes against the same prefix, the cache stays warm indefinitely.

For sessions with longer gaps between turns, Anthropic offers an extended 1-hour TTL. You opt into it by using {"type": "ephemeral", "ttl": "1h"} instead of {"type": "ephemeral"} on your cache control block.

The extended TTL has a different write price. More on that in the pricing section. The important operational point is that 1-hour TTL is only available on newer models: Opus 4.5 and later, Sonnet 4.5 and later, Haiku 4.5 and later. If you are on an older model and you need longer cache lifetimes, you will need to structure your sessions to stay within 5-minute windows or migrate to a newer model.

The Pricing Math#

Four relevant operations, four multipliers:

Operation	Cost multiplier
Uncached input (fresh tokens)	1× base input price
5-minute cache write	1.25× base input price
1-hour cache write	2× base input price
Cache read (hit)	0.1× base input price

A cache read is ten times cheaper than a fresh input token. A 5-minute cache write costs 25% more than processing fresh. So the break-even point for 5-minute TTL is after exactly one cache hit. Write the cache once for 1.25×, read it back once for 0.1×, and you have already paid 1.35× to process tokens that would have cost 2× uncached. On the second read you are deep into savings territory.

For 1-hour TTL the write costs 2×, so you need two cache reads to break even. After two reads the math is the same.

In practice, a multi-turn agent session with a stable prefix will hit the cache dozens of times. The economics compound quickly. A session with a 100K-token stable prefix on Claude Opus at current pricing costs roughly $15 in input tokens if you process everything fresh every turn. Get to a 90% cache hit rate and you are looking at something closer to $1.65 for the same session. That is not a marginal improvement.

Reading the API Response#

Every API response includes a usage object that tells you exactly what happened with the cache on that request:

json code-highlight

{
  "usage": {
    "input_tokens": 50,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 100000,
    "output_tokens": 393
  }
}

Three fields to understand:

cache_creation_input_tokens is how many tokens were written to cache on this request. You will see this as a nonzero number on the first turn of a session (the cold start), when the cache is being populated. After that it should be zero for most turns.

cache_read_input_tokens is how many tokens were read from cache on this request. A healthy session will have this dominating from the second turn onward. If you see this as zero on every turn, your cache is not working.

input_tokens is how many tokens were processed fresh, after the last cache breakpoint. In a healthy session this is a small number: just the new message and any content appended after the last cache boundary.

A healthy turn-two response looks something like: input_tokens: 50, cache_creation_input_tokens: 0, cache_read_input_tokens: 100000. The 100K token prefix came from cache. Only the new 50 tokens were processed fresh.

When I looked at my failing session, every turn showed input_tokens: 80050 and cache_read_input_tokens: 0. The 80K stable prefix was being recomputed every single time.

What Actually Invalidates Your Cache#

Knowing what invalidates the cache is as important as knowing how to set it up.

Change	Cache effect
Tool definitions modified	Entire cache gone (tools are first in serialization order)
System prompt text modified	System prompt cache and all messages cache gone
New message appended	Only the new message is uncached, prefix is preserved
`tool_choice` parameter changes	Messages cache invalidated, tools and system are fine
Image added or removed from messages	Messages cache invalidated, prefix is preserved

The pattern is clear: changes higher in the serialization hierarchy are more expensive than changes lower down. Modifying a tool definition is the most costly change you can make because it invalidates everything. Appending a new message at the end is the cheapest because it preserves everything above it.

My session failure turned out to be a timestamp embedded in the tool definitions. A current_time field that was being injected fresh into the tool schema on every request. One small string change per turn, top of the hierarchy, complete cache miss every time. Removing that field and passing the current time in the message instead fixed it immediately.

Static-First Prompt Architecture#

This is the practical implication of everything above.

Your prompt has a natural hierarchy: tools, then system, then messages. The further up the hierarchy a piece of information lives, the more expensive it is to change. So the design principle follows: put everything that changes frequently as far down the hierarchy as possible.

Tools should almost never change during a session. Their definitions should be fixed at session start and stay fixed. If you need to pass dynamic information, put it in the messages, not the tool schemas.

System prompts should be stable across sessions. Instructions, constraints, persona, formatting rules. These do not need to change per-user or per-request. If you are personalizing your system prompt per user, you are blowing your cache for every unique user. Consider whether that personalization can live in an early message instead.

Messages are where dynamic content belongs. Per-user context, per-request instructions, the actual task at hand. These live at the bottom of the hierarchy and their changes do not cascade upward.

This is what Claude Code's engineering team refers to when they describe their internal mantra as "Cache Rules Everything Around Me." Their shared prefix, consisting of around 18,000 tokens of system prompt plus tools plus a globally shared CLAUDE.md, is cached once and served to all users. They run monitoring on cache hit rate and treat drops below 80% as incidents worth escalating. The shared prefix never changes during a deployment.

If you are building anything that resembles a long-running AI agent harness, this kind of static-first thinking is not optional. You are making potentially hundreds of API calls per session. The cost difference between a well-cached prefix and an uncached one is the difference between a viable product and one that bleeds money at scale.

The Real Economics#

Let me be specific about what "90% cache hit rate" actually means in dollar terms, because the abstraction of "10x cheaper reads" can be hard to feel until you see actual numbers.

Claude Opus charges at a certain per-token rate for input. A 100,000-token prefix per turn, completely uncached, across a 20-turn session, means you are paying for 2,000,000 input tokens just in prefix repetition. That is a lot of money for work the model has already done.

With caching set up correctly, turns 2 through 20 read those 100K tokens from cache at 0.1x cost. You pay full price once for the write, 1.25x for the initial cache population, and then 0.1x for 19 reads. The prefix cost for the same session drops by close to 90%.

Output tokens are unchanged. The new message tokens each turn are unchanged. Only the stable prefix cost collapses.

This is also why the cost structure of agentic applications is so different from traditional software. The same agent running 10 sessions a day at 20 turns each accumulates 200 turns of prefix-heavy API calls. The cache hit rate across that volume is what determines whether the product is economically viable or not. It is not a nice-to-have optimization. It is load-bearing.

For teams building multi-agent systems, the math gets even more interesting. Each agent in an orchestrated workflow has its own context window and its own caching behavior. An orchestrator with a large system prompt that spins up subagents is potentially making dozens of API calls, each of which could benefit from a cached prefix or be burning money unnecessarily. Getting caching right across the whole system requires thinking about each agent's prefix stability independently.

What I Actually Changed#

For completeness: the fix to my session was three things.

First, I removed the dynamic timestamp from the tool definition schema and passed it in the first user message instead.

Second, I added explicit cache_control: {"type": "ephemeral"} to the last system prompt block and to the last element of the tools array.

Third, I added logging of cache_read_input_tokens on every API response, so I could see in real time whether the cache was warm. The first turn of a new session now shows cache_creation_input_tokens > 0. Every subsequent turn shows cache_read_input_tokens > 0. If I ever see two consecutive turns with zero cache reads, the log alerts me to investigate.

It took about 30 minutes to implement. The next session that ran the same workflow cost about $1.40 instead of $12.

The five pillars of agentic engineering covers this kind of cost-aware architecture as part of the broader discipline of building agentic systems that actually work in production. Prompt caching is one tool in that discipline. But it is a particularly high-leverage one because the gains are immediate and measurable.

FAQ#

Does prompt caching work across different users, or only within one session?

The cache is keyed by the exact byte sequence of the prefix. If two different users send requests with identical prefixes (same tool definitions, same system prompt, same message history up to that point), they will both benefit from the same cached KV tensors. This is how Claude Code's engineering team gets value from a globally shared prefix across all users. The cache is not session-scoped at the API level. It is content-scoped. Identical prefixes share cache entries regardless of which user or session sent the request.

What happens if my prompt is below the minimum token threshold?

The request succeeds and the response comes back normally. Nothing breaks. But cache_creation_input_tokens will be zero and no cache entry is created. The API does not warn you. This is one of the most common reasons people think caching is working when it is not. Check the threshold for your specific model and make sure your prefix exceeds it. The minimum is 1,024 tokens for most Sonnet and Opus variants, 2,048 for Sonnet 4.6 and Haiku 3.5, and 4,096 for Haiku 4.5 and newer Opus models.

How does the 5-minute TTL affect long-running agent sessions?

The TTL refreshes on every cache hit at no extra cost. So as long as your agent is making at least one request every five minutes, the cache stays warm indefinitely. The TTL only starts counting down from the last hit, not from the initial write. For sessions with longer pauses between turns, the 1-hour TTL option (available on Sonnet 4.5+, Opus 4.5+, Haiku 4.5+) makes more sense, though it costs 2x base input price to write instead of 1.25x. For very infrequent access patterns, you may need to accept the occasional cold-start cost.

Does switching models within a session affect the cache?

Yes. Cache entries are model-specific. The KV tensors computed by Claude Sonnet 4.5 are not compatible with those computed by Claude Opus 4.1. If you switch models mid-session, you pay full write cost again to populate the cache for the new model. In practice, most sessions use a single model throughout, so this does not come up often. But if you are building a system that routes to different models based on task complexity, be aware that every model switch is a cache cold start.

Can I check if my prompts are actually being cached?

Yes, and you should always do this. The usage field in every API response includes cache_creation_input_tokens and cache_read_input_tokens. If you are setting up caching and both fields are zero on turn two of a session, something is wrong. Either the prefix is below the minimum threshold, the prefix changed between requests, or your cache_control block is not correctly placed. Logging these two fields on every request is the simplest way to monitor cache health. You can also compute a running cache hit rate and alert when it drops.

Does prompt caching work with streaming responses?

Yes. Streaming affects how the output tokens are delivered to your client but it does not change how the input tokens are processed on the server side. The cache is populated and read during the input token processing phase, before any output is generated. So the same caching behavior applies whether you are using streaming or not. Your usage object will still be accurate, though with streaming it typically appears at the end of the stream rather than in the initial response body.