All posts
Architecture Published 7 min

Why your agent keeps failing after 3 steps

The exit condition problem nobody talks about. Most agents are built for the happy path, where every tool call succeeds and the task completes cleanly. Real production agents are different.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (3 sections)

Walk into any agent codebase and look at the loop. Most look like this: while not done, think, act, observe. The "done" condition is hand-wavy, usually "the model said it is done." That works for tutorials. It does not work for production, where the model says it is done when it is not, or never says it is done at all.

The three exits an agent actually has

  • Success: the model achieved the goal and you can verify the achievement independently, not just take its word.
  • Bounded failure: the agent ran out of budget (steps, time, or tokens) without success, and the calling code handles it.
  • Unbounded failure: anything else, including silent loops, partial completions claimed as full ones, or "I cannot continue" with no real reason.

Most agents are coded as if only the first exit exists. The third one is where the "fails after 3 steps" reports come from: the agent hits a situation its happy-path loop never anticipated and has no defined way out, so it loops, stalls, or lies about completion.

The three exits and how to handle each
ExitWhat it looks likeRequired handling
SuccessGoal met and verifiableIndependent success check, not self-report
Bounded failureBudget exhausted, no successTyped failure the caller handles
Unbounded failureLoops, false completion, vague stopDetect and convert to bounded failure

Designing for all three

Every agent loop needs three things: an explicit step budget, an explicit success check that does not just trust the model's self-report, and a typed failure mode the calling code is required to handle. The calling code, not the agent, decides what to do on failure. This is the same conclusion I reached the hard way with unbounded self-correction loops in three patterns I broke in 2025: the model should not be the thing that decides when to stop.

You also want to see these failures as a population, not one ticket at a time. The step-count histogram from the agent observability stack we ship is where unbounded failures show up as a growing long tail, and unhappy-path exits deserve dedicated cases in your eval suite.

Common mistakes

  • Trusting "the model said it is done" as the success check, with nothing verifying the claim.
  • Having no step budget, so an agent that cannot converge runs until something else times out.
  • Treating failure as something the agent recovers from internally instead of a typed result the caller handles.
  • Never writing tests for the failure exits, so they are only exercised in production.

The agent that fails after 3 steps is usually the agent that was never told what "done" means, only what "do" means. Spec the exit conditions before you spec the tools. Getting this right is one of the most reliable improvements I make to a wobbling production agent in consulting.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook