All posts
Production Published 7 min

Prompt caching is not optional anymore. Measuring a 47% cost drop

A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (5 sections)

Numbers from a recent engagement: 47% reduction in monthly model spend, no change in output quality, two days of engineering work. Prompt caching is not optional once your traffic stabilises. It is the same theme as the cheapest LLM call is the one you do not make, applied to the calls you cannot avoid: make the tokens you resend cost a fraction of full price.

How caching actually saves money

A cache hit means the provider already has the prefix of your prompt in a fast store and bills it at a steep discount instead of full input price. The catch is that caching keys on an exact, contiguous prefix. The moment something dynamic appears in that prefix, everything after it is uncacheable. So the whole game is arranging your prompt so the stable part comes first and stays byte-for-byte identical across requests.

Step 1: find your stable prefixes

Audit your system prompts. Anything that does not change per request is a cache candidate: tool definitions, role and style instructions, retrieval indexes for closed corpora, few-shot examples. The user message and any per-request context go after the cacheable section, never threaded through it.

Step 2: restructure for cache hits

Move all stable content into one contiguous block at the start of the prompt and place the cache breakpoint after that block. The mistake I see most often is interleaving cacheable and non-cacheable content, for example dropping a timestamp or a per-user greeting into the middle of the system prompt, which forfeits the cache benefit even though "caching is enabled" in the dashboard.

What belongs before and after the cache breakpoint
ContentStable across requests?Placement
Tool definitionsYesBefore (cached)
Role and style instructionsYesBefore (cached)
Closed-corpus reference docsYesBefore (cached)
Per-user context, timestampsNoAfter the breakpoint
The user messageNoAfter the breakpoint

Step 3: instrument it

You need cache-hit telemetry, not just usage telemetry. Most providers return cache-read counts in response metadata. Log them. If you cannot see your cache-hit rate per route, you cannot optimise it, and you cannot tell the difference between caching that works and caching that is silently defeated by an interleaved dynamic token. Aim for 70%+ on the routes where you have stable prefixes; this is one of the core panels in the agent observability stack we ship.

Common mistakes

  • Enabling caching but interleaving a timestamp or per-user string into the stable block, which silently forfeits every hit after it.
  • Measuring total token usage and assuming caching works, without ever logging cache-read counts.
  • Caching a prefix that changes more often than you think, such as a tool list rebuilt in a non-deterministic order each request.

The boring engineering wins. Caching is not glamorous, but it is the single biggest ROI optimisation I do for clients in 2026, and it pairs naturally with the registry and pre-agentic work from the cost post above. If you want this done as a focused engagement, it is a two-day consulting project for most stabilised systems.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook