The agent observability stack we ship to every client
Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic; covers Langfuse, Honeycomb, and rolling your own.
In this post (4 sections)
Every client engagement we run ships an observability layer with the agent. Skipping this is the single most common reason I see production agents quietly degrade over time. Nobody can see them degrade, so a tool that started failing last Tuesday keeps failing until a customer complains. Observability is what turns that silent decay into a dashboard line you can act on.
The stack
- Per-step traces with input, output, model, tool calls, and timing, so any single request can be replayed.
- Cost attribution per request and per route, the foundation for the savings work in the cost posts.
- A cache-hit rate panel, kept separate from usage metrics so a broken cache cannot hide behind healthy token counts.
- An eval suite that runs on every deploy plus a nightly canary, so regressions surface before and after release.
- A completion-rate gauge: of N user requests today, what fraction reached a terminal state?
The one panel that catches everything
Step-count distribution. A histogram of how many agent steps each request took. When this distribution shifts right or grows a long tail, something is wrong. Usually a tool starting to fail, a prompt change degrading first-pass accuracy, or a model update that needs re-tuning. I have caught more regressions on this single panel than every other piece of telemetry combined.
It works because step count is a leading indicator. Before accuracy visibly drops, before costs spike, the agent starts taking more steps to reach the same answer, retrying tools and re-planning. The long tail of that histogram is where the agents that fail after three steps live, and watching the distribution lets you find them as a population rather than one support ticket at a time.
Measure cost per completed task, not per token
Cost attribution is only useful if you divide by outcomes. A route can look cheap per token while being expensive per completed task because it retries constantly. Pairing cost-per-route with the completion-rate gauge gives you cost per completed task, the metric I argue for in Haiku 4.5 made our router 5x cheaper and the cheapest LLM call is the one you do not make. The cache-hit panel ties straight back to prompt caching is not optional anymore.
Vendor-agnostic
Langfuse if you want a managed product. Honeycomb if you already have it for your services. OpenTelemetry directly if you want to roll your own and keep options open. The stack matters less than the discipline of looking at it. Pick the tool your team will actually open every morning.
Whatever you choose, ship it on day one rather than bolting it on after the first incident. Standing this up is part of every consulting engagement I run and a module in our training, because an agent you cannot observe is an agent you cannot keep healthy.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.