How much can prompt caching actually save?

On one engagement it cut monthly model spend by 47% with no quality change and two days of work. The exact figure depends on how much of your prompt is stable and how high your repeat-traffic rate is, but any stabilised system with a large system prompt has meaningful savings available.

Why is the order of my prompt so important for caching?

Caching keys on an exact, contiguous prefix. As soon as a dynamic value appears, everything after it becomes uncacheable. Putting all stable content first, with the cache breakpoint after it, is what lets the provider reuse the prefix across requests.

What commonly breaks caching even when it is "enabled"?

Interleaving dynamic content into the stable block, such as a timestamp, a per-user greeting, or a tool list assembled in a non-deterministic order. Any of these changes the prefix and defeats the cache from that point on.

What cache-hit rate should I aim for?

Target 70%+ on routes that have genuinely stable prefixes. Some stable-system-prompt workloads reach the low 80s. The exact target matters less than tracking it per route so you can see regressions.

How do I prove caching is working?

Log the cache-read counts that providers return in response metadata, broken out per route. Usage metrics alone cannot tell you whether tokens were billed at full price or at the cached discount.

Prompt Caching: Measuring a 47% Cost Drop

In this post (5 sections)

In this post

Numbers from a recent engagement: 47% reduction in monthly model spend, no change in output quality, two days of engineering work. Prompt caching is not optional once your traffic stabilises. It is the same theme as the cheapest LLM call is the one you do not make, applied to the calls you cannot avoid: make the tokens you resend cost a fraction of full price.

How caching actually saves money

A cache hit means the provider already has the prefix of your prompt in a fast store and bills it at a steep discount instead of full input price. The catch is that caching keys on an exact, contiguous prefix. The moment something dynamic appears in that prefix, everything after it is uncacheable. So the whole game is arranging your prompt so the stable part comes first and stays byte-for-byte identical across requests.

Step 1: find your stable prefixes

Audit your system prompts. Anything that does not change per request is a cache candidate: tool definitions, role and style instructions, retrieval indexes for closed corpora, few-shot examples. The user message and any per-request context go after the cacheable section, never threaded through it.

Step 2: restructure for cache hits

Move all stable content into one contiguous block at the start of the prompt and place the cache breakpoint after that block. The mistake I see most often is interleaving cacheable and non-cacheable content, for example dropping a timestamp or a per-user greeting into the middle of the system prompt, which forfeits the cache benefit even though "caching is enabled" in the dashboard.

What belongs before and after the cache breakpoint

Content	Stable across requests?	Placement
Tool definitions	Yes	Before (cached)
Role and style instructions	Yes	Before (cached)
Closed-corpus reference docs	Yes	Before (cached)
Per-user context, timestamps	No	After the breakpoint
The user message	No	After the breakpoint

Step 3: instrument it

You need cache-hit telemetry, not just usage telemetry. Most providers return cache-read counts in response metadata. Log them. If you cannot see your cache-hit rate per route, you cannot optimise it, and you cannot tell the difference between caching that works and caching that is silently defeated by an interleaved dynamic token. Aim for 70%+ on the routes where you have stable prefixes; this is one of the core panels in the agent observability stack we ship.

Common mistakes

Enabling caching but interleaving a timestamp or per-user string into the stable block, which silently forfeits every hit after it.
Measuring total token usage and assuming caching works, without ever logging cache-read counts.
Caching a prefix that changes more often than you think, such as a tool list rebuilt in a non-deterministic order each request.

The boring engineering wins. Caching is not glamorous, but it is the single biggest ROI optimisation I do for clients in 2026, and it pairs naturally with the registry and pre-agentic work from the cost post above. If you want this done as a focused engagement, it is a two-day consulting project for most stabilised systems.

Prompt caching is not optional anymore. Measuring a 47% cost drop

How caching actually saves money

Step 1: find your stable prefixes

Step 2: restructure for cache hits

Step 3: instrument it

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Prompt caching is not optional anymore. Measuring a 47% cost drop

How caching actually saves money

Step 1: find your stable prefixes

Step 2: restructure for cache hits

Step 3: instrument it

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Claude Code Artifacts turn terminal output into live review pages: what Team and Enterprise buyers should pilot first

Agentjacking is real: poisoned Sentry errors can hijack Cursor, Claude Code, and Codex without touching your repo

The June 15 Claude billing change: Agent SDK credits, model retirement, and the checklist I run before anything breaks