How do I know if the model or the tool registry is the problem?

Look at consistency. Random wrong-tool calls point at the prompt, context, or temperature. The same wrong tool chosen every time for the same kind of request points at the registry, because the model is reading your descriptions and reaching the conclusion you wrote down for it.

What should a tool description actually contain?

A selection criterion, not an implementation summary. Lead with "Use this when…", give one example user request that should map to the tool, and spell out the lookalike situations where it should not be used. The name and parameter schema carry the rest.

Does removing unused tools really matter?

Yes, on two axes. Each registered tool adds schema overhead (GitHub measured roughly 8-12 KB per unused tool per call), and every extra candidate is one more thing the model scores against, which lowers selection accuracy before it raises the bill.

What is tool-selection telemetry and how do I use it?

It is per-call scoring that shows how strongly the model preferred each candidate tool. Use it to find pairs of tools that win by a thin margin and deliberately widen that gap by sharpening the winner and adding exclusions to the runner-up.

How is this different from prompt engineering?

Tool descriptions are a specific kind of prompt: the one the model reads while deciding which capability to invoke. You can leave the system prompt untouched and still change behaviour dramatically by rewriting the registry, because selection happens against the tool strings, not the system message.

Tool Descriptions Are Prompts: Fix the Registry

In this post (7 sections)

In this post

I watched an agent in a customer system pick the wrong tool three times in a row last week. The reflex on the team was to debug the model: log the prompts, retry with bigger context, ablate the system message. The model was fine. The tool registry was the problem. This is the most common misdiagnosis I see in production agent work, and it is the reason I wrote a companion piece on the three questions to ask when your agents are not broken, your tools are.

The registry is the spec the model reads at runtime

The mental model most engineers carry over from REST API design is wrong here. When you call a REST endpoint, you know which one to invoke because you read your own code. When a model picks a tool, it reads the description you wrote for it. Whatever you put in that string is the spec the model selects against. There is no other source of truth at runtime.

Say that back slowly, because it is the whole post. The function body does not influence selection. The variable names in your implementation do not influence selection. The Jira ticket that explains why the tool exists does not influence selection. Only the name, the description, and the parameter schema reach the model. If those three strings are vague, the model is guessing, and a guessing model looks exactly like a broken one from the outside.

Anthropic shipped tool-use telemetry that surfaces the per-call selection scores. For the first time you can see in numbers what the model thought your tool registry meant. The teams running it in production report the same thing I see in consulting engagements: the descriptions are the bug. I go deeper on structuring the whole registry in tool registry design for agentic AI; this post is the field-repair version.

Why debugging the model is the wrong reflex

Reaching for the model first feels right because the model is the new, mysterious part of the system. But the model is also the part you cannot change. You can swap it, you can prompt it, you can turn the temperature down, and none of that addresses a registry where two tools answer to the same description. You are tuning the one component that is behaving correctly.

The tell is consistency. A model that picks the wrong tool randomly is a prompt or context problem. A model that picks the same wrong tool every time for the same class of request is a registry problem. It is reading your descriptions and arriving at the same conclusion you wrote down for it. The fix lives in the string, not the sampler.

Three rules I apply before any agent debug

1. Name tools precisely

"search" is not a tool name. "search_customer_orders_by_email" is. Vague names lose to specific ones every time, regardless of model strength. The name is the first thing the model reads and the cheapest signal you can sharpen. A precise name does half the selection work before the description is even consulted.

2. Describe when to use each tool, not what it does

The description should answer "given this request, should I pick this?", not "what API does this wrap?". The model already infers function from a good name. What it needs from you is selection criteria: the situations this tool is for, and the lookalike situations it is not for. I expand this into six concrete edits in tool descriptions are prompts, stop treating them like docstrings.

3. Load only the tools the task needs

Every unused tool in scope is noise the model scores against, and it costs real money. GitHub measured roughly 8-12 KB of schema overhead per unused tool on every call, a point I unpack in the cheapest LLM call is the one you do not make. The cognitive cost shows up as wrong-tool calls before the bill does. Scope the default load to the task; keep the long tail in specialised scopes the agent loads on demand.

A concrete before and after

Recent engagement: an ERP integration agent had a "get_record" tool described as "Fetches a record from the database." The agent kept passing customer IDs to a function that took order IDs. The fix was not a smarter prompt. It was the description:

name: get_customer_record
description: Use this when the user asks about a CUSTOMER (account, profile, contact). Takes a customer_id (UUID). Do not use for orders or invoices; see get_order_record for those.

Selection accuracy on the affected workflow jumped from 71% to 96% in the next eval run. No model change. No prompt rewrite. The description was already a prompt; it just had not been written like one. The exclusion clause ("do not use for orders or invoices") did most of the work, because the failure was a lookalike, not an unknown.

Common wrong-tool symptoms and the registry cause behind them

Symptom	Usual registry cause	The fix
Same wrong tool every time	Two tools share an overlapping description	Add exclusions; differentiate "use this when"
Wrong parameter passed	Param schema names carry no meaning	Rename params; describe units and ID type
Right idea, wrong mode	One tool does two jobs	Split it; see one tool, one purpose
Drifts as the session grows	Too many tools loaded by default	Trim the default load; scope the rest
Random misfires	Not a registry problem	Look at prompt, context, and temperature

Reading the tool-selection telemetry

The new telemetry gives you a score per candidate tool per call. Treat it like a confusion matrix. When the correct tool wins by a hair over a near-twin, that gap is your warning that the next context change will flip the decision. Widen the gap deliberately by sharpening the winner's "use this when" and adding an exclusion to the runner-up. You are not chasing a single failing call, you are engineering the margin between lookalikes.

Log these scores the same way you would log any other production signal. A tool whose win margin is shrinking release over release is a regression in slow motion, and you want to catch it on a dashboard rather than in a support ticket.

Where to start in your codebase

01
Inventory the registry
List every tool you have registered. Anything not called in the last 30 days, drop from the default load and keep it in a specialised scope.
02
Rewrite descriptions as selection criteria
For each remaining tool, lead with "Use this when…" and add the exclusions. "Use this when…" beats "This function returns…" every time.
03
Add a tool-selection regression test
Write evals that ask the agent to pick the right tool for ambiguous inputs. Track the score deltas over time. That is your tool-registry regression test.

Common mistakes

Copying the function docstring into the description. The docstring is for a developer who already decided to call it; the description is for a model still deciding.
Listing every tool "just in case." Unused tools are pure downside: cost, latency, and selection noise.
Fixing one failing call by hand instead of widening the margin between the two tools that keep getting confused.
Treating consistent wrong-tool selection as a model failure and reaching for a bigger model.

Most "the agent does not work" reports I get from teams turn out to be tool-registry reports in disguise. The agent is doing its job: scoring options against the descriptions you provided. Change the descriptions and the behaviour changes. That is not a workaround. That is the design.

Tool descriptions are prompts. Fix the registry, not the agent.

The registry is the spec the model reads at runtime

Why debugging the model is the wrong reflex

Three rules I apply before any agent debug

1. Name tools precisely

2. Describe when to use each tool, not what it does

3. Load only the tools the task needs

A concrete before and after

Reading the tool-selection telemetry

Where to start in your codebase

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Tool descriptions are prompts. Fix the registry, not the agent.

The registry is the spec the model reads at runtime

Why debugging the model is the wrong reflex

Three rules I apply before any agent debug

1. Name tools precisely

2. Describe when to use each tool, not what it does

3. Load only the tools the task needs

A concrete before and after

Reading the tool-selection telemetry

Where to start in your codebase

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Your agents aren't broken, your tools are: three questions to ask before you build one

Tool registry design for agentic AI: how the wrong registry kills accuracy before the prompt is read

Tool descriptions are prompts. Stop treating them like docstrings