AI Observability.You cannot fix what you cannot see.
Observability is the unsexy half of production AI that decides whether you keep your job. Per-step traces, score-bearing evals, cache-hit telemetry, tool-selection scores — these are the primitives that let you debug agents instead of guessing at them.
The four telemetry layers
Input/output traces per step (Langfuse-shaped). Eval scoring on representative inputs, not just happy paths. Cache-hit rate per route — the new throughput metric. Tool-selection score deltas from the model boundary. Without all four, multi-agent debugging is folklore.
Eval datasets that surface real bugs
A 100-case eval set that scores 95% means nothing if all 100 cases are the happy path. The cases that matter are the ambiguous ones, the conflicting ones, the malformed ones. Build the eval set adversarially — ask "what would a real user actually paste in?" — and most agents drop 15–25 points.
Per-agent cost attribution
Tools like Langfuse now expose per-agent cost attribution and step-level cache-hit telemetry. You will discover 30–50% of your agent traffic is repeat queries you should be caching. This is the report nobody wants to read but everyone needs to.
Deep dives on AI Observability
Claude Code Artifacts turn terminal output into live review pages: what Team and Enterprise buyers should pilot first
Artifacts in Claude Code beta publish self-contained HTML to claude.ai that republishes to the same URL as the session progresses, with version history and org-only sharing. Strict CSP, no external fetch, no backend. Requires Team or Enterprise and claude.ai login. Here is the workflow I use for PR walkthroughs and incident timelines without screenshot threads in Slack.
Agentjacking is real: poisoned Sentry errors can hijack Cursor, Claude Code, and Codex without touching your repo
Tenet Threat Labs injected a fake stack trace through a public Sentry DSN and watched 100+ coding agents execute attacker commands during normal triage. No git write access required. The agent treats the error as ground truth. Here is how I harden observability MCP feeds, scope triage prompts, and block auto-exec on untrusted telemetry.
The June 15 Claude billing change: Agent SDK credits, model retirement, and the checklist I run before anything breaks
Two Anthropic changes land on the same day: programmatic Claude usage moves to a separate monthly credit pool, and claude-opus-4-20250514 plus claude-sonnet-4-20250514 stop answering on the API. Interactive Claude Code is fine. Cron jobs and CI agents are not. Here is how I audit auth paths, claim credits, and grep for retiring model IDs before the first failed run.
Governing agent autonomy in 2026: Auto-review, pre-push review, and why approval prompts are not a security model
Cursor made Auto-review the default run mode and shipped /review so Bugbot runs before you push. Together they treat agent autonomy as a dial: low-stakes actions flow, high-stakes actions slow down. Here is how I wire that pattern into local agents, SDK headless runs, and CI without mistaking convenience for a hard security boundary.
Agentic RAG vs vanilla RAG: why a Sufficient Context Agent beats retrieve-then-pray
Google Research shipped Agentic RAG on Gemini Enterprise with a Sufficient Context Agent that refuses to answer when retrieval is incomplete. On factuality benchmarks they report up to 34% higher accuracy versus standard RAG. Here is when one-shot RAG is still enough, when you need iterative retrieval, and how I wire the pattern without blowing latency budgets.
Agentic transformation is an operating-model problem, not a model problem
Microsoft published a 6-step playbook for rolling agents out across an enterprise, and the line that matters is "you do not need a bigger model, you need a better operating model." That matches what I see in consulting: the pilots that die do not die on model quality, they die on ownership, evals, and governance. Here is how I read the playbook for IT services teams, and the operating-model gaps that actually stall agent rollouts.
Your coding agent has amnesia. Persistent memory is the fix.
Claude Code forgets your architecture, your decisions, and why you ruled things out the moment a session ends. The reliability tax is not tokens, it is re-establishing context every morning. Here is what persistent agent memory actually is, how an open-source engine like Cortex implements it, and how to evaluate a memory layer for your own agents.
Your agent's supply chain is the attack surface now
A poisoned VS Code extension spent eighteen minutes on the marketplace and walked off with Claude Code credentials and MCP configs. The model was never the target. Your agent's supply chain is: the extensions, skills, MCP servers, tool definitions, and keys it is allowed to touch. Here is how I harden all four layers, and the checklist I run on every deployment.
How an agentic studio screens, scores and shortlists candidates for your hiring team
Open Recruiting Atelier and you do not see a generic AI dashboard. You see five named specialists doing the work a screening team would do: catching duplicates, checking the brief, scoring on four dimensions, ranking, drafting the dispatch. Drop one CV or fifty. Click any candidate to see exactly why they landed where they did. This is what AI for recruitment looks like when it respects your judgment instead of replacing it.
Tool registry design for agentic AI: how the wrong registry kills accuracy before the prompt is read
I reviewed a system last month with 47 tools in its registry and a 22 percent wrong-tool-selection rate. The team was about to migrate from Sonnet to Opus to fix it. The prompt was fine. The registry was the bug. This is the audit pattern I run on every client codebase before we change anything else, the seven failure modes I see in production, and the numbers from the cleanup.
AI agent vs agentic AI: what the distinction actually means when you ship one
Vendors blur the line because "agentic" sells. The two terms describe different architectures, with different cost shapes, different observability needs, and different scoping conversations. Here is the framing I use with clients and the three-question test for which one your project actually needs.
MCP governance just became a product: what Databricks Unity AI Gateway changes for enterprise agents
Every enterprise MCP deployment I have audited in the last six months has been hand-rolling tool-access policy, payload logging, and per-team cost limits on top of a gateway someone wrote in two days. Databricks just shipped that as a product. Here is what it actually changes, where the gaps still are, and the migration I would run for a Databricks shop.
Tool descriptions are prompts. Fix the registry, not the agent.
When an agent picks the wrong tool, the registry is broken, not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.
The cheapest LLM call is the one you do not make. GitHub's 19-62% token cut, decoded
GitHub published an instrumented analysis of their agentic CI workflows and reported 19-62% token-cost reductions. The savings are the headline. The technique (pre-agentic data fetching and tool-registry hygiene) is the story most teams will miss.
Claude Opus 4.7's 1M context: when to RAG and when to just stuff it
A million tokens reliably is real now, but it does not retire RAG. It changes the calculus. Cost, latency, recency, and the prompt-cache angle nobody is talking about.
Prompt caching is not optional anymore. Measuring a 47% cost drop
A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.
The agent observability stack we ship to every client
Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic; covers Langfuse, Honeycomb, and rolling your own.
Eval datasets: stop testing your agents on the happy path
If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.
RAG vs CAG: how to actually decide
A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use, and when to combine both, determines whether your agent finds the right answer at the right cost.
Visual breakdowns on AI Observability
Latest in AI Observability
Xiaomi MiMo V2-Flash and TTS endpoints auto-route to MiMo-V2.5 on June 18: legacy model IDs retire June 30
Tenet demonstrates Agentjacking: a poisoned Sentry error report hijacks Cursor, Claude Code, and Codex into running attacker code with no repo compromise
Zhipu ships GLM-5.2: MIT open weights, 1M context, and Anthropic-compatible API for long-horizon coding agents
IIT Bombay unveils BharatGen Param2: a 17B MoE with tool calling across all 22 scheduled Indian languages, plus Shrutam2 ASR and Patram document vision
Claude Opus 4 and Sonnet 4 retire on the API today: requests to claude-opus-4-20250514 and claude-sonnet-4-20250514 now fail
MiniMax M3 open weights ship on Hugging Face: 428B MoE with 1M sparse-attention context, native multimodality, and computer use
Google ships Agentic RAG on Gemini Enterprise with a Sufficient Context Agent that stops when retrieval is incomplete
Alibaba ships Qwen3.7-Plus as a hybrid GUI-and-CLI agent: native screen grounding, 1M context, and Anthropic-compatible API endpoints
How AI Observability ships in our engagements
The pages below are the buyer-focused, conversion-grade versions of this topic — deliverables, methodology, ROI, security considerations, and CTAs to scope a real engagement.
Agentic AI Consulting
Designed, built, and handed off — production agentic systems for enterprise teams.
Explore the Agentic AI Consulting solutionAI Guardrails
Multi-layer safety, policy, and audit controls for agents in regulated environments.
Explore the AI Guardrails solutionAI Systems Engineering Training
Eight-day corporate training programs that take dev teams from AI-assisted coding to production agentic systems.
Explore the AI Systems Engineering Training solutionEnterprise AI Architecture
Reference architectures for organisations standing up an AI platform — not one agent, but the foundation for many.
Explore the Enterprise AI Architecture solutionAI Observability
Tracing, eval, cache-hit telemetry, and cost attribution for production agents.
Explore the AI Observability solutionMulti-Agent Workflows
Supervisor + handoff orchestration for portfolios of agents that need to cooperate without arguing.
Explore the Multi-Agent Workflows solutionAI Observability — the questions teams actually ask
Train your team on AI Observability
Two tracks — one for developers who build agents, one for business teams who use them. Customised to your stack, hands-on from session 1.
See AI Observability training tracksShip your first AI Observability system
Architecture design, production implementation on Claude API and MCP, full observability, and a real handoff. Working agents, not slides.
Explore AI Observability consultingAdjacent topics to read next
Go deeper on this topic
New breakdowns on this and related agentic AI topics, plus what I am shipping for clients — one email on Thursdays.

