17 min read

The Static-First Prompt Architecture

Prompt caching is not a feature you toggle on. It is an architectural constraint. Here is the layered structure and 4-breakpoint strategy that makes it work reliably.

aiprompt-cachingcontext-engineeringagentic-engineeringclaude-codellm-costs

I enabled prompt caching on an agent I had been running for a few weeks. Checked the usage dashboard the next day expecting to see cache hits everywhere. What I actually saw was cache_creation_input_tokens spiking on nearly every single turn.

The cache was being rebuilt constantly. Not a few times. Almost every request.

My first assumption was that I had gotten the API call wrong, maybe a missing header or a malformed cache_control block. I spent a while checking that. Everything looked correct. The cache breakpoints were in the right places syntactically. But the hits were not coming.

It took me an embarrassingly long time to figure out what was actually happening. The problem was not the cache configuration. The problem was the prompt structure. I had been injecting dynamic context into the system prompt, updating it each turn with the current time, the current file state, the current git status. And every time I did that, the entire prefix hash changed, and the cache was cold again.

That is the thing about prompt caching architecture that nobody says clearly enough: caching is not a feature you turn on. It is an architectural constraint you design around from the start. If the prompt structure is wrong, the best cache configuration in the world does nothing.

This is the second post in a four-part series on prompt caching. The first post covers what caching is and when it matters. This one is about the canonical layered structure that makes caching work reliably, the four-breakpoint strategy, and the rules that hold across every framework I have worked with.


Table of Contents#


Try it live: the Prompt Caching Demo lets you chat with Claude and watch cache_write, cache_read, and hit rate update after each response. The Agno and LangGraph implementations from this post are visible in the code panel. Worth opening alongside this article.

The Core Mental Model: The Prompt Is an Append-Only Log#

Here is the rule that changed how I think about this: never modify content that has already been sent.

The system prompt, tool definitions, and any project context you load at session start are written once. They do not change. Everything that comes after, new information, current state, dynamic context, goes into new messages appended at the tail.

Think of the first API call as writing to stone. Everything after that is papyrus appended below. You do not erase the stone to add a date. You write the date on the next piece of papyrus.

This sounds obvious. It is not how most people build. The instinct when you need to pass current file state to the model is to update the system prompt, because that is where you put "things the model needs to know." But the system prompt is not a state container. It is a frozen definition of who the agent is and what it can do. The moment you start mutating it mid-session, the cache is dead.

The distinction matters because the cache key is a hash of everything from the beginning of the prompt up to the breakpoint. Change one character anywhere in that prefix, and you are starting over. The cache does not do partial matches.


The Five Layers of a Well-Structured Prompt#

When you hold the append-only constraint in mind, a natural layered structure falls out. Each layer has a different lifetime and a different role.

Static-first prompt architecture: five layers stacked top to bottom, tool definitions, system prompt, project context, conversation history (all cached), then current user message (dynamic, never cached), with cache breakpoints marked on the right
Static-first prompt architecture: five layers stacked top to bottom, tool definitions, system prompt, project context, conversation history (all cached), then current user message (dynamic, never cached), with cache breakpoints marked on the right

Layer 1: The static system prompt

Frozen at initialization. This is where agent identity lives: behavioral rules, capabilities description, constraints. It never mutates during a session.

In Claude Code's actual implementation, this is approximately 4,000 tokens. It is shared globally across all users of Claude Code, not rebuilt per session. Because it never changes, Anthropic can pre-warm it and maintain it in cache permanently.

Layer 2: Tool definitions

Locked at session start. Full JSON schemas for every tool the agent has access to. This layer matters because of something non-obvious about how Anthropic's internal ordering works: tools are serialized before the system prompt and messages in the cache prefix calculation. The order is tools, then system, then messages. So a cache_control breakpoint placed after the last tool definition caches all tools as a single prefix block.

The practical implication: if you change any tool definition mid-session, you invalidate the entire tool cache, which is usually your largest block. Register all tools at initialization and do not touch them.

Layer 3: Project context

Semi-static. In Claude Code, this is the CLAUDE.md file. It is shared across all sessions within a project but different between projects. This is the ideal location for a third cache breakpoint. It changes rarely, it is substantial in size, and it sits cleanly above the dynamic conversation layer.

Layer 4: Session state and summaries

Per-session context: compaction summaries from previous context windows, retrieved memories, session-specific instructions. This layer changes less frequently than individual messages but more frequently than the project context above it.

Layer 5: Conversation messages

The dynamic tail. Every turn adds to this. Dynamic context like the current time, git status, and open file contents should be injected here via <system-reminder> tags in user messages. Not in the system prompt. Not in layer 3. Here, in the message that is already new, where changing it costs you nothing because it was never part of the cached prefix.

This is the pattern I was doing wrong. I was pulling dynamic state up into the system prompt because I thought of it as "context the model needs." It is context the model needs, yes. But needs does not mean "must be in the system prompt." It means "must be somewhere in the request." Messages work fine. And messages do not break the cache.


The Four-Breakpoint Strategy#

Anthropic allows up to four explicit cache_control breakpoints per request. Four is not a lot, but it is enough if you place them strategically.

The canonical placement:

  1. After the last tool definition, to cache all tool schemas as a unit
  2. After the static system prompt, to cache the system + tools prefix
  3. After project context (CLAUDE.md or equivalent), to cache the full static prefix
  4. After the last user message, or rely on automatic caching for the conversation tail

Here is what that looks like in a raw API request:

json code-highlight
{
  "model": "claude-sonnet-4-20250514",
  "tools": [
    {"name": "read_file", "description": "...", "input_schema": {"...": "..."}},
    {"name": "write_file", "description": "...", "input_schema": {"...": "..."}, "cache_control": {"type": "ephemeral"}}
  ],
  "system": [
    {"type": "text", "text": "You are a coding assistant...", "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "[CLAUDE.md contents here]", "cache_control": {"type": "ephemeral"}}
  ],
  "messages": [
    {"role": "user", "content": "...", "cache_control": {"type": "ephemeral"}}
  ]
}

The cache_control on the last tool definition covers everything up to and including that tool. The first system block covers the static system prompt plus the tools cached before it. The second system block covers the project context. The final one on the user message covers the conversation tail.

One thing worth noting: "ephemeral" is currently the only cache type. It sounds like it means the cache expires quickly, but the actual TTL is five minutes with a clock reset on each access. In a live agent session, the cache for any frequently-used prefix stays warm indefinitely, because every turn resets the clock.


What This Looks Like for Claude Code Specifically#

Claude Code is a useful case study because Anthropic has been transparent about how it works.

The static system prompt is about 4,000 tokens. It is cached globally and shared across all Claude Code users. When you start a session, you never pay full price for that 4,000 tokens because it was already cached.

Tool definitions are locked at session start. All of Claude Code's tools, read_file, write_file, bash, and so on, get cached as a block when the first message is sent.

CLAUDE.md gets loaded as a third breakpoint. If your CLAUDE.md is 2,000 tokens, that plus the system prompt plus the tools form a static prefix of roughly 18,000 tokens. That block is cached and warm for the entire session.

Dynamic state like git status, open file contents, and current working directory comes in via <system-reminder> tags in user messages. If you have used Claude Code and noticed those <system-reminder> annotations in the raw request, that is exactly what they are: a mechanism for injecting dynamic context into the message layer without touching the cached prefix.

The result is 90 to 96 percent cache hit rates across Claude Code sessions. That number is achievable specifically because the architecture was designed from the start to keep the static prefix stable.

If you are building something similar, the question to ask is: what is the smallest possible surface area that needs to change each turn? Everything outside that surface area belongs in the static prefix. Everything inside it belongs in messages.


The Compaction Fork Pattern#

Context windows fill up. When they do, the instinct is to summarize. But how you summarize matters enormously for caching.

The wrong approach is to take the filled context and run it through a "summarizer" agent with a different system prompt, something like "You are a summarization assistant, condense this conversation." That creates a cold cache entry under a completely different prefix. You pay full input price for the summary, and then when you start the new session with that summary, you are starting cold again.

The right approach is the compaction fork: use the same system prompt and the same tool definitions. Do not change the prefix at all. Append a user message asking for a summary of the conversation so far. Take the response. Then start a new session with the original prefix intact and the summary as the opening user message.

Post-compaction, the cached prefix stays warm. The first turn of the new session hits the cache for everything except the summary message itself. You are paying for maybe 500 tokens of new content, not 18,000 tokens of re-cached prefix.

The principle here is the same as the append-only rule: do not mutate what is already stable. The prefix is stable. The message tail is what changes. Compaction replaces the tail. It leaves the prefix alone.


Seven Cross-Framework Rules#

I have worked with this across a few different frameworks at this point. LangGraph, direct API calls, some lighter wrappers. The following rules hold regardless of what you are building on.

Rule 1: The prompt is an append-only log. Never modify content already sent. This is the foundation everything else rests on.

Rule 2: Tools are immutable for the session's lifetime. Register all tools at initialization. If a tool definition changes, the entire tool cache is cold. The cost of a full cold-start write on 18K tokens is not trivial.

Rule 3: Dynamic context belongs in messages. Current time, file state, git status, environment variables. All of it goes in <system-reminder> tags in user messages. Not in the system prompt. Not in project context. In messages.

Rule 4: Compaction forks the prefix, not the prompt. Summarize using the same system prompt and tools. Replace only the message tail.

Rule 5: Model switching requires subagents. Caches are per-model. Switching from Opus to Haiku mid-session means a full cold-start cache write under the Haiku model ID. If you are 100,000 tokens into an Opus conversation and you decide to switch to Haiku to answer a "simple question," you will pay to write a 100,000-token cache entry under Haiku. That is almost certainly more expensive than having Opus answer the question with its warm cache. If you need a cheaper model for some subtask, spawn a subagent with its own session. Do not switch models mid-session.

Rule 6: Deterministic serialization is mandatory. The cache key is a hash of the prefix content. If your JSON serialization is non-deterministic, the hash will vary between requests even when the logical content is identical. This is a real problem in Swift (Dictionary ordering is not guaranteed) and Go (map iteration order is random). Pin serialization order explicitly. Sort keys. Use an ordered data structure if your language requires it.

Rule 7: The defer-loading pattern for large tool registries. If you have a large set of MCP tools, including all their full schemas in the tool definitions block makes the block large and inflexible. Instead, register lightweight stubs. Include a ToolSearch meta-tool that returns full schemas as message content when called. The full schemas arrive as conversation content, not as tool definition changes, so they do not break the cache. This is the pattern behind how the ToolSearch and deferred tool loading works in Claude Code.


The Cost of Getting This Wrong#

Let me put actual numbers on this, because the abstract argument for good architecture is less convincing than a specific bill.

Assume an 18,000-token static prefix. Tools plus system prompt plus CLAUDE.md. You are running on Claude Opus 4 at $15 per million input tokens.

Every turn without caching, the full prefix costs 18,000 tokens times $0.000015, which is $0.27 in prefix cost per turn. Across a 50-turn session that is $13.50 just for the prefix, not counting output tokens or the dynamic message content.

With caching, the cached prefix costs $0.0000015 per token, which is one-tenth the uncached price. The same 50-turn session costs $1.35 in prefix tokens.

That is $12.15 difference per session. From prompt structure alone. The model output is identical either way. The agent is doing the same work. The only difference is whether the prefix is static enough to cache.

At any real scale, that difference is not a rounding error. It is the argument for redesigning the architecture.

If you are working through the broader cost dynamics of agentic systems, the post on why agentic apps cost so little to build and so much to run goes deeper on the token economics. And if you are running multi-session agents where context management compounds across sessions, the post on long-running agent harnesses covers how to structure that.

The thing that clicked for me when I finally understood this is that prompt caching is not an optimization you apply at the end. It is a constraint that shapes the design from the start. The layered structure is not arbitrary. Each layer exists because it has a different rate of change, and matching the cache strategy to that rate is what makes the economics work.

For multi-agent setups where several agents share a common static prefix, the cache is effectively shared infrastructure. One agent warming the cache benefits all the others. That compound effect is why teams building multi-agent systems at scale care about this so much. The per-session savings multiply across every agent running concurrently.

The next post in this series gets into cross-session cache warming strategies and what to do when the 5-minute TTL is a constraint. But the foundation is what is here: static prefix, append-only messages, four breakpoints placed at the right layer boundaries, and the compaction fork when the context fills.

Get the structure right first. The rest is tuning.


FAQ#

How does automatic caching differ from explicit cache_control breakpoints?

Explicit breakpoints let you mark exactly where you want cache boundaries, up to four per request. You place a cache_control block at the end of the content you want cached, and the API stores everything from the start of the request to that point as a single cache entry.

Automatic caching, which Anthropic introduced for longer prompts, works without any explicit annotation. If a request exceeds a token threshold (currently around 1,024 tokens for most models), the API will automatically attempt to cache the longest matching prefix from your previous requests. It is less precise than explicit breakpoints but requires no changes to existing code. For most production agents, explicit breakpoints give you more control over where the cache boundaries land, which matters when you have a known static prefix you want to protect.

Why are tools serialized before the system prompt in Anthropic's internal ordering?

The tools-first ordering means a cache_control on the last tool definition creates a cache entry that includes only tools. A cache_control on the system prompt then creates a larger entry that covers tools plus system. This nesting means you can cache the tool definitions independently of the system prompt. If you have a shared tool registry across multiple agents with different system prompts, they can all share the same tool cache entry. It is a design that allows mix-and-match at the layer boundaries.

How does the compaction fork handle the case where tool definitions change between sessions?

This is the scenario where the answer is: treat it as a cold start. If tool definitions genuinely need to change between sessions, the old cache entries are invalid and you will pay to write new ones. The compaction fork pattern is specifically for mid-session context management, where the prefix stays constant and only the message tail is being replaced. If you are doing a new session with new tools, you are starting fresh. The cache warms back up over the first few turns of the new session.

If I am using a managed framework like LangChain, does the serialization order still matter?

Yes, because the cache key is computed from the raw bytes of the request content, not from your framework's logical representation. If LangChain (or any other framework) serializes tool schemas with non-deterministic key ordering, two requests with logically identical tools will produce different cache keys. Check what your framework actually sends on the wire. The easiest way to verify is to log the raw API request body and check whether it is identical across two calls with the same inputs. If it is not, the cache will never hit regardless of your breakpoint placement.

What is the practical difference between putting dynamic context in the system prompt versus in messages?

The system prompt is part of the cached prefix. Anything you put there becomes part of the cache key. If you update the system prompt with the current time on every turn, the cache key changes on every turn, and you get zero cache hits. Messages are outside the prefix cache boundary. Adding a new message does not invalidate any previously cached content. It just extends the tail. Dynamic context like timestamps, file state, and environment variables always belongs in messages. The system prompt should contain nothing that changes more often than once per session, ideally nothing that changes more often than once per project.

Does the four-breakpoint limit apply per request or per session?

Per request. You get four explicit cache_control markers per API call. Those four slots reset with each new request. The practical effect for a session is that you are choosing, on every request, which four points in the prompt to mark as cache boundaries. In a well-structured prompt, the answer is always the same four points: after tools, after system, after project context, and after the last user message. Because the answer is always the same, the same cache entries get hit on every turn, which is exactly what you want.

Share:

Stay in the loop

New posts on AI engineering, Claude Code, and building with agents.