Why do production agents degrade silently?

Because without instrumentation there is no signal when a tool starts failing, a prompt change lowers first-pass accuracy, or a model update shifts behaviour. The agent keeps returning answers, just worse ones, until a customer notices. Observability converts that invisible decay into a metric you can watch.

Why track cache-hit rate separately from usage?

Because healthy-looking token usage can hide a cache that has silently stopped working. A dedicated cache-hit panel makes a broken prefix obvious instead of letting it disappear into aggregate usage numbers.

What does "cost per completed task" mean?

Cost attributed per route divided by how many requests actually reached a terminal state. It captures retries and rework that cost-per-token hides, and it is the number you should optimise because it reflects real outcomes.

Which observability tool should I use?

Whichever your team will actually look at. Langfuse for a managed agent-focused product, Honeycomb if you already run it for services, or OpenTelemetry directly to keep options open. The discipline of reviewing the dashboards matters more than the specific vendor.

The Agent Observability Stack We Ship to Clients

Q: What is the single most useful observability panel?

The step-count distribution, a histogram of how many steps each request took. It is a leading indicator: the agent starts taking more steps to reach the same answer before accuracy or cost visibly move, so a rightward shift or a growing long tail flags trouble early.

In this post (4 sections)

In this post

Every client engagement we run ships an observability layer with the agent. Skipping this is the single most common reason I see production agents quietly degrade over time. Nobody can see them degrade, so a tool that started failing last Tuesday keeps failing until a customer complains. Observability is what turns that silent decay into a dashboard line you can act on.

The stack

Per-step traces with input, output, model, tool calls, and timing, so any single request can be replayed.
Cost attribution per request and per route, the foundation for the savings work in the cost posts.
A cache-hit rate panel, kept separate from usage metrics so a broken cache cannot hide behind healthy token counts.
An eval suite that runs on every deploy plus a nightly canary, so regressions surface before and after release.
A completion-rate gauge: of N user requests today, what fraction reached a terminal state?

The one panel that catches everything

Step-count distribution. A histogram of how many agent steps each request took. When this distribution shifts right or grows a long tail, something is wrong. Usually a tool starting to fail, a prompt change degrading first-pass accuracy, or a model update that needs re-tuning. I have caught more regressions on this single panel than every other piece of telemetry combined.

It works because step count is a leading indicator. Before accuracy visibly drops, before costs spike, the agent starts taking more steps to reach the same answer, retrying tools and re-planning. The long tail of that histogram is where the agents that fail after three steps live, and watching the distribution lets you find them as a population rather than one support ticket at a time.

Five signals and what each one catches

Signal	Catches	Why it earns its place
Per-step traces	Where a single request went wrong	Replay beats guessing
Cost per route	Where the spend concentrates	Targets optimisation work
Cache-hit panel	Silently broken caching	Usage metrics hide it
Eval suite + canary	Regressions at and after deploy	Pre and post release coverage
Step-count histogram	Most early degradation	Leading indicator of trouble

Measure cost per completed task, not per token

Cost attribution is only useful if you divide by outcomes. A route can look cheap per token while being expensive per completed task because it retries constantly. Pairing cost-per-route with the completion-rate gauge gives you cost per completed task, the metric I argue for in Haiku 4.5 made our router 5x cheaper and the cheapest LLM call is the one you do not make. The cache-hit panel ties straight back to prompt caching is not optional anymore.

Vendor-agnostic

Langfuse if you want a managed product. Honeycomb if you already have it for your services. OpenTelemetry directly if you want to roll your own and keep options open. The stack matters less than the discipline of looking at it. Pick the tool your team will actually open every morning.

Whatever you choose, ship it on day one rather than bolting it on after the first incident. Standing this up is part of every consulting engagement I run and a module in our training, because an agent you cannot observe is an agent you cannot keep healthy.

The agent observability stack we ship to every client

The stack

The one panel that catches everything

Measure cost per completed task, not per token

Vendor-agnostic

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

The agent observability stack we ship to every client

The stack

The one panel that catches everything

Measure cost per completed task, not per token

Vendor-agnostic

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Claude Code Artifacts turn terminal output into live review pages: what Team and Enterprise buyers should pilot first

Agentjacking is real: poisoned Sentry errors can hijack Cursor, Claude Code, and Codex without touching your repo

The June 15 Claude billing change: Agent SDK credits, model retirement, and the checklist I run before anything breaks