RAG vs CAG: how to actually decide
A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use, and when to combine both, determines whether your agent finds the right answer at the right cost.
In this post (4 sections)
RAG retrieves at request time. CAG, for cache or context augmented generation, stores frequently-needed content in the prompt cache so it is reused across calls. They solve overlapping problems with different costs, and the choice is not ideological, it is a function of how your corpus behaves. This is the framework version of the question I raised in Opus 4.7's 1M context: RAG or just stuff it.
When to RAG
- The corpus is too large to fit in context, or it grows unboundedly.
- Content changes frequently, such as product catalogues or ticket queues.
- Per-user access controls apply, where different users must see different subsets.
When to CAG
- Stable reference material everyone needs, such as style guides, schemas, or framework docs.
- A high repeat-query rate against the same corpus, so the cached prefix is reused constantly.
- Latency-sensitive paths where a retrieval round-trip costs more than reading the cached prefix.
When to combine
Most production systems end up doing both. CAG the stable reference material, RAG the volatile or per-user content. The practical rule: cache anything that does not change in 24 hours, retrieve everything else. Then measure cache-hit rate, which tells you whether you got the split right; the instrumentation for that is in prompt caching is not optional anymore and the agent observability stack we ship.
Common mistakes
- Caching per-user content into a shared prefix, which is both a cache-pollution problem and a data-leak risk.
- Retrieving stable reference material on every call, paying a round-trip for content that never changes.
- Choosing one approach for the whole system instead of splitting by how each slice of content behaves.
- Picking a split and never measuring cache-hit rate to confirm it.
RAG and CAG are not rivals, they are two tools for two kinds of content. Get the split right and your agent finds the right answer at the right cost; get it wrong and you pay in latency, spend, or stale answers. Designing that split for a specific workload is a common consulting starting point.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.