Tool descriptions are prompts. Fix the registry, not the agent.
When an agent picks the wrong tool, the registry is broken, not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.
In this post (7 sections)
I watched an agent in a customer system pick the wrong tool three times in a row last week. The reflex on the team was to debug the model: log the prompts, retry with bigger context, ablate the system message. The model was fine. The tool registry was the problem. This is the most common misdiagnosis I see in production agent work, and it is the reason I wrote a companion piece on the three questions to ask when your agents are not broken, your tools are.
The registry is the spec the model reads at runtime
The mental model most engineers carry over from REST API design is wrong here. When you call a REST endpoint, you know which one to invoke because you read your own code. When a model picks a tool, it reads the description you wrote for it. Whatever you put in that string is the spec the model selects against. There is no other source of truth at runtime.
Say that back slowly, because it is the whole post. The function body does not influence selection. The variable names in your implementation do not influence selection. The Jira ticket that explains why the tool exists does not influence selection. Only the name, the description, and the parameter schema reach the model. If those three strings are vague, the model is guessing, and a guessing model looks exactly like a broken one from the outside.
Anthropic shipped tool-use telemetry that surfaces the per-call selection scores. For the first time you can see in numbers what the model thought your tool registry meant. The teams running it in production report the same thing I see in consulting engagements: the descriptions are the bug. I go deeper on structuring the whole registry in tool registry design for agentic AI; this post is the field-repair version.
Why debugging the model is the wrong reflex
Reaching for the model first feels right because the model is the new, mysterious part of the system. But the model is also the part you cannot change. You can swap it, you can prompt it, you can turn the temperature down, and none of that addresses a registry where two tools answer to the same description. You are tuning the one component that is behaving correctly.
The tell is consistency. A model that picks the wrong tool randomly is a prompt or context problem. A model that picks the same wrong tool every time for the same class of request is a registry problem. It is reading your descriptions and arriving at the same conclusion you wrote down for it. The fix lives in the string, not the sampler.
Three rules I apply before any agent debug
1. Name tools precisely
"search" is not a tool name. "search_customer_orders_by_email" is. Vague names lose to specific ones every time, regardless of model strength. The name is the first thing the model reads and the cheapest signal you can sharpen. A precise name does half the selection work before the description is even consulted.
2. Describe when to use each tool, not what it does
The description should answer "given this request, should I pick this?", not "what API does this wrap?". The model already infers function from a good name. What it needs from you is selection criteria: the situations this tool is for, and the lookalike situations it is not for. I expand this into six concrete edits in tool descriptions are prompts, stop treating them like docstrings.
3. Load only the tools the task needs
Every unused tool in scope is noise the model scores against, and it costs real money. GitHub measured roughly 8-12 KB of schema overhead per unused tool on every call, a point I unpack in the cheapest LLM call is the one you do not make. The cognitive cost shows up as wrong-tool calls before the bill does. Scope the default load to the task; keep the long tail in specialised scopes the agent loads on demand.
A concrete before and after
Recent engagement: an ERP integration agent had a "get_record" tool described as "Fetches a record from the database." The agent kept passing customer IDs to a function that took order IDs. The fix was not a smarter prompt. It was the description:
name: get_customer_record
description: Use this when the user asks about a CUSTOMER (account, profile, contact). Takes a customer_id (UUID). Do not use for orders or invoices; see get_order_record for those.Selection accuracy on the affected workflow jumped from 71% to 96% in the next eval run. No model change. No prompt rewrite. The description was already a prompt; it just had not been written like one. The exclusion clause ("do not use for orders or invoices") did most of the work, because the failure was a lookalike, not an unknown.
Reading the tool-selection telemetry
The new telemetry gives you a score per candidate tool per call. Treat it like a confusion matrix. When the correct tool wins by a hair over a near-twin, that gap is your warning that the next context change will flip the decision. Widen the gap deliberately by sharpening the winner's "use this when" and adding an exclusion to the runner-up. You are not chasing a single failing call, you are engineering the margin between lookalikes.
Log these scores the same way you would log any other production signal. A tool whose win margin is shrinking release over release is a regression in slow motion, and you want to catch it on a dashboard rather than in a support ticket.
Where to start in your codebase
- 01Inventory the registryList every tool you have registered. Anything not called in the last 30 days, drop from the default load and keep it in a specialised scope.
- 02Rewrite descriptions as selection criteriaFor each remaining tool, lead with "Use this when…" and add the exclusions. "Use this when…" beats "This function returns…" every time.
- 03Add a tool-selection regression testWrite evals that ask the agent to pick the right tool for ambiguous inputs. Track the score deltas over time. That is your tool-registry regression test.
Common mistakes
- Copying the function docstring into the description. The docstring is for a developer who already decided to call it; the description is for a model still deciding.
- Listing every tool "just in case." Unused tools are pure downside: cost, latency, and selection noise.
- Fixing one failing call by hand instead of widening the margin between the two tools that keep getting confused.
- Treating consistent wrong-tool selection as a model failure and reaching for a bigger model.
Most "the agent does not work" reports I get from teams turn out to be tool-registry reports in disguise. The agent is doing its job: scoring options against the descriptions you provided. Change the descriptions and the behaviour changes. That is not a workaround. That is the design.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.