When NOT to Build an Agent
Not every problem needs an AI agent. This article gives you a decision framework for when agents are the wrong choice, with a comparison table, anti-patterns, and the Klarna case study that proves the cost of over-engineering.
Part 2 of 12: The Practitioner’s Guide to AI Agents
The Agent I Wish I Had Not Built
I recently watched a team tackle a problem that had a clear, deterministic solution. The inputs were structured. The outputs were predictable. The mapping between them was known. A rule-based engine would have handled it reliably, cheaply, and with full auditability.
Instead, the team built it with an LLM and a reasoning model. Every request became a token-heavy API call to get a response that a lookup table could have produced. The system worked, but it was expensive, slower than it needed to be, and harder to debug when the output was wrong. The failure modes were opaque in the way agent failures always are: you had to trace through the model’s reasoning to understand why it chose one path over another, when the “reasoning” was never necessary in the first place.
I do not say this to criticize the team. The instinct was understandable. When every vendor pitch and conference talk says “build an AI agent,” it takes discipline to step back and ask whether the problem actually requires one. The hype cycle creates pressure to reach for LLMs even when simpler, more deterministic approaches would produce better results.
That experience is the seed of this article. Article 1 defined what an agent IS: a system that uses an LLM to decide which actions to take in a loop until a goal is met. This article defines what an agent is NOT: the answer to every automation problem.
The Six Disqualifiers
Not every problem deserves an agent. Here are six conditions where an agent is the wrong tool. If your use case matches even one, pause and consider a simpler alternative.
1. The Logic Is Deterministic
If you can express the task as a decision tree, flowchart, or lookup table, you do not need an LLM to reason about it. A script will be faster, cheaper, and more reliable.
Examples: routing tickets based on category codes, applying business rules to structured data, transforming file formats, validating inputs against a schema.
The test is simple. Can you write an if-else chain that handles every case? If yes, do that. Agents add value when the input is ambiguous and the correct action depends on interpretation. If the input is structured and the mapping is known, an agent is overhead.
2. Latency Cannot Tolerate a Reasoning Loop
An agent loop adds seconds per step. Each iteration involves an LLM call (typically 500ms to 3 seconds), a tool execution, and another LLM call to evaluate the result. A three-step agent workflow takes 5 to 15 seconds in the best case.
If your use case requires sub-second response times, an agent cannot deliver. Payment processing, real-time fraud scoring, API gateways, user-facing autocomplete: these need deterministic speed. An agent’s variable latency will break your SLA.
3. The Task Is a Simple Input-Output Mapping
If the input has a known structure and the output is a direct transformation, do not route it through an LLM. A database lookup, a unit conversion, a tax calculation, a string format change: these are functions, not reasoning problems.
The cost difference is not trivial. A database lookup costs fractions of a cent and returns in milliseconds. An LLM call to produce the same answer costs 1 to 10 cents and takes seconds. At scale, this multiplies. A million lookups per day at $0.01 each is $10,000/month for work that a SQL query handles for the price of compute.
4. The Process Must Be Fully Auditable
Regulated industries (finance, healthcare, insurance) often require complete auditability: every decision must be traceable to specific rules, with a clear chain of logic that an auditor can follow.
Agent reasoning is probabilistic. The same input can produce different reasoning paths on different runs. The LLM’s “chain of thought” is a plausible narrative, not a deterministic proof. When a regulator asks “why did the system make this decision?”, pointing to a prompt and a temperature setting is not an adequate answer.
This does not mean agents have no place in regulated environments. It means the agent should not be the decision-maker for auditable processes. It can assist a human, surface relevant information, or draft a recommendation. But the auditable decision itself should follow deterministic, rule-based logic.
The distinction matters more than it appears. “A human reviews it” is not enough. The human reviewing agent output in a regulated process needs domain expertise to catch when the agent’s recommendation looks plausible but is wrong. I wrote about this in Judgment-in-the-Loop: human-in-the-loop is a checkpoint, but judgment-in-the-loop is a capability. In auditable processes, you need the latter. Someone who knows the domain well enough to recognize that the agent’s confident summary missed a regulatory nuance or misinterpreted a threshold. Without that judgment, the human review becomes a rubber stamp, and the audit trail is worth less than it looks.
5. Cost Per Query Matters at Your Scale
Agents are expensive per invocation. A single agent task that involves three LLM calls and two tool executions costs 5 to 50 cents, depending on the model and context window size. Traditional API calls for the same task cost fractions of a cent.
This math gets brutal at volume. Consider a Data Quality monitoring system that checks 100,000 records daily. If each check requires an agent call at $0.05, that is $5,000 per day, or $150,000 per month. A rule-based validation script checking the same records costs a few dollars in compute.
But the cost comparison requires honesty about what each approach actually catches. A rule-based script excels at structural checks: null counts, row counts, schema conformance, freshness thresholds. These are the high-volume, low-ambiguity checks where spending $0.05 per record on an LLM is genuinely wasteful. Run them with Great Expectations or Soda at a fraction of a cent per check. That covers the 80%.
The remaining 20% is where a Data Quality agent earns its cost. Value distribution drift, cross-column consistency, semantic anomalies where the data is structurally valid but meaningfully wrong: these require reasoning over context, not threshold comparisons. The right architecture is a hybrid. Let the rule-based script handle 100,000 record-level validations at scale. Let the agent handle the 50 to 200 table-level and column-level semantic checks that a script cannot reason about. At 200 agent checks per day at $0.05 each, you spend $10/day instead of $5,000. The agent handles the failures that cost you weeks of investigation when they go undetected. The script handles everything else.
The IBM survey of 2,000 CEOs found that only one in four AI projects delivers on its promised ROI. Cost overruns are a leading reason. Before choosing an agent, do the multiplication: per-query cost times daily volume times 30 days. If the number makes your finance team uncomfortable, ask whether you are sending the agent the right work. Often the answer is not “do not use an agent” but “use an agent for fewer, harder checks and a script for everything else.”
6. “Almost Right” Is Unacceptable
This is the most underappreciated disqualifier. Agents produce outputs that are almost right most of the time. The compound error math from a related article illustrates the cost: if each step in an agent workflow has 85% accuracy, a ten-step workflow succeeds only 20% of the time (0.85^10 = 0.20). That is not a reliability problem you can tune away. It is a structural property of probabilistic systems chained together.
For some use cases, “almost right” is fine. A research summary that captures 90% of the key points is useful. A draft email that needs minor editing saves time.
For other use cases, “almost right” is worse than wrong. A financial calculation that is off by 2% can trigger incorrect trades. A medical dosage recommendation that is 95% correct is 5% dangerous. A legal contract clause generated by an agent that “looks right” but contains a subtle error can cost millions.
If the failure mode of “close but not quite” is worse than the failure mode of “clearly broken,” do not use an agent for that task.
Agent vs. Script: A Side-by-Side Comparison
The same task, solved two ways. This table shows when the script wins and when the agent earns its complexity.
| Dimension | Script / Traditional Automation | AI Agent |
|---|---|---|
| Input type | Structured, predictable | Unstructured, ambiguous, variable |
| Logic | Deterministic: if X then Y | Probabilistic: interpret X, reason about Y |
| Latency | Milliseconds | Seconds to minutes |
| Cost per call | Fractions of a cent | 1-50 cents |
| Auditability | Complete: every branch is traceable | Partial: reasoning path varies per run |
| Failure mode | Predictable: known edge cases | Unpredictable: novel errors on novel inputs |
| Error handling | Explicit: catch every exception | Implicit: the model “decides” how to handle errors |
| Scalability | Linear cost curve | Cost scales with token usage and model pricing |
| Maintenance | Update rules when requirements change | Update prompts, context, tools, and evals |
| Best for | Known workflows, high volume, strict SLAs | Novel problems, ambiguous input, multi-step reasoning |
The pattern is clear. Scripts win on cost, speed, predictability, and auditability. Agents win on flexibility, adaptability, and handling ambiguity. The mistake teams make is reaching for an agent when the problem sits squarely in the left column.
Anti-Patterns: Agents That Should Be Scripts
These are real patterns I have seen across client engagements and in industry case studies. Each one represents an agent built for a problem that did not need one.
The LLM-powered lookup table. A team builds an agent to answer “what is the owner of dataset X?” The agent calls a metadata API, parses the response, and returns the owner name. A single API call with a JSON path extraction does the same thing in 10 milliseconds for zero LLM cost.
The natural language SQL translator for one query. Instead of writing a parameterized SQL query, a team builds an agent that takes natural language input and generates SQL. The catch: the users always ask the same three questions with minor variations. Three parameterized queries with a dropdown menu would be faster, cheaper, and guaranteed correct.
The AI-powered email classifier. Incoming emails get classified into five categories. An agent reads each email, reasons about its content, and assigns a label. But 90% of the emails contain explicit keywords (“invoice,” “support request,” “meeting”) that a regex filter handles perfectly. The agent runs on the remaining 10% where the keywords are absent. Better: use the regex filter for the 90% and route only the ambiguous 10% to an LLM call (not even a full agent, just a single classification call).
The chatbot pretending to be an agent. A “customer service agent” that answers questions from a knowledge base but never calls tools, never loops, never adapts its approach. This is a RAG pipeline with extra branding. It works fine as a RAG pipeline. Calling it an agent sets expectations the system cannot meet.
The monitoring agent that checks the same thing every time. An agent runs every hour to check if a data pipeline completed. It calls a status API, reads the response, and sends a Slack message if the pipeline failed. A cron job with curl and a Slack webhook does this identically. The agent adds latency, cost, and a probabilistic failure mode to a task that is entirely deterministic.
The Decision Flowchart
Before starting any agent project, run through this decision tree.
The flowchart asks four questions in sequence:
- Does the task require reasoning over ambiguous input? If the input is structured and the mapping is known, use a script.
- Does it need to adapt its approach based on intermediate results? If the steps are fixed regardless of what happens in between, use a script.
- Does it require multiple tools? If it only needs one API call or one database query, a function call is simpler than an agent loop.
- Will inputs vary in unexpected ways? If you can enumerate every possible input pattern, hardcode the handling. Agents earn their cost when the input space is genuinely open-ended.
If you answer “No” to any of these questions, you probably do not need an agent. The flowchart is deliberately conservative. It is cheaper to start with a script and upgrade to an agent when the script’s limitations become measurable than to start with an agent and discover you over-engineered.
The Klarna Lesson: What Happens When You Over-Agent
The most instructive case study of 2025 is not a startup. It is Klarna.
In 2023, Klarna announced that its AI chatbot, powered by OpenAI, was doing the work of 700 human agents. The company stopped hiring human customer service representatives entirely. CEO Sebastian Siemiatkowski publicly celebrated the cost savings.
By mid-2025, Klarna reversed course. The company began rehiring human agents after customer satisfaction scores dropped and operational problems surfaced. Siemiatkowski admitted the mistake to Bloomberg: the company had “focused too much on efficiency and cost” at the expense of quality.
The volume metrics looked great. The AI handled 2.3 million conversations in its first month. Resolution rates were high. Time-to-first-response was fast. But these aggregate metrics masked a quality problem: the AI produced confident, generic responses that failed on edge cases requiring empathy, discretion, or escalation. Customers reported robotic responses, inflexible scripts, and the frustrating loop of repeating their issue to a human after the bot failed.
The deeper lesson is not “AI bad, humans good.” Klarna’s mistake was applying an AI-powered system to an entire problem space when only part of it needed AI. Basic inquiries (order status, return policies, payment schedules) were handled well by automation. Complex complaints, disputes, and edge cases needed human judgment. The right architecture was a hybrid: automate the deterministic 70%, route the ambiguous 30% to humans, and use AI to assist (not replace) the human on complex cases.
This is the “when NOT to build an agent” lesson at enterprise scale. The question was never “can AI handle customer service?” It was “which parts of customer service are deterministic enough for automation, and which parts require the kind of adaptive reasoning that justifies an agent or a human?”
Connecting the Gartner Prediction
Gartner’s prediction that over 40% of agentic AI projects will be cancelled by end of 2027 gets cited frequently. Most commentary frames it as a technology maturity problem: the models are not ready, the tooling is immature, governance is absent.
But Gartner’s own data tells a different story. The press release identifies “agent washing” as a primary driver: vendors rebranding existing chatbots, RPA tools, and AI assistants as “agentic” without adding genuine agent capabilities. Gartner estimates that only about 130 of the thousands of self-described agentic AI vendors have real agent technology.
The cancellation prediction is not about agents failing. It is about non-agents being sold as agents, and the projects built on them collapsing when the mismatch becomes obvious. HBR’s analysis of agentic AI failure reinforces this: projects fail due to bad use-case selection, not bad technology. The fix is not better agents. It is better judgment about when to use them.
This reframes the 40% cancellation rate from a warning about technology to a warning about decision-making. If your project is in that 40%, the most likely reason is not that the agent did not work. It is that the problem did not need an agent.
The Compound Error Tax
The compound error math (0.85^10 = 0.20) appears throughout this series as a reliability concern. Here, I want to reframe it as a cost.
Every unnecessary agent step is a tax on reliability. If a five-step script handles your task at 99.9% reliability per step, the compound reliability is 99.5%. Replace that script with a five-step agent at 85% per step, and reliability drops to 44%. You have traded 99.5% for 44% and paid more per invocation to do it.
The tax compounds further when you account for debugging time. When a script fails, the error is in the code. You read the stack trace, find the line, fix it. When an agent fails, the error might be in the prompt, the context, the tool response, the model’s reasoning, or the interaction between all four. Debugging agent failures takes 5 to 10 times longer than debugging script failures in my experience.
This is not an argument against agents. It is an argument for precision. Use agents where the reasoning capability justifies the reliability tax. Use scripts everywhere else. The best agent architectures are the ones that minimize the number of steps that require probabilistic reasoning.
Anthropic’s own guidance on building effective agents makes the same point: start with the simplest possible solution and add agentic complexity only when the simpler version’s limitations are measurable.
A Practical Heuristic
When someone proposes building an agent, ask three questions:
1. “What would the non-agent version look like?” If nobody can describe the simpler alternative, the team has not understood the problem well enough to build any solution, let alone an agent.
2. “Where does this task require genuine reasoning?” If the answer is “nowhere, but the input is natural language,” you need a parser, not an agent. Natural language input does not automatically mean you need an LLM loop.
3. “What is the cost of being wrong?” If the answer is “we redo the task manually,” an agent is fine. If the answer is “we lose money, violate a regulation, or harm a customer,” the probabilistic nature of agents is a risk that needs explicit mitigation.
These three questions filter out the majority of misapplied agent projects. They are not anti-agent. They are pro-precision.
Do Next
| Priority | Action | Why it matters |
|---|---|---|
| No experience | Take one task you are considering for AI and describe the non-agent version first. Write out the if-else logic, the lookup tables, the API calls. See how far the simple version gets before it breaks down. | Most people skip this step and jump straight to “let’s build an agent.” The simple version is your baseline. If it handles 90% of cases, you only need AI for the remaining 10%, and that 10% might need a single LLM call rather than a full agent. |
| Learning | Run through the four-question decision flowchart above for three agent ideas you have seen or considered. For each one, document whether it passes or fails the test. | Building the habit of asking “does this need an agent?” before writing code saves weeks of wasted effort. Three examples give you enough practice to internalize the pattern. |
| Practitioner | Audit one existing agent in your organization against the six disqualifiers. Calculate the per-query cost at current volume. Compare it to a rule-based alternative for the deterministic portions. | Most organizations have at least one agent that should be a script. Finding it and replacing it saves money, reduces failure surfaces, and frees engineering time for problems that genuinely need agents. |
This is Part 2 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: What Is an AI Agent? · Next: Pike’s Five Rules →
Sources & References
- Gartner: Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027(2025)
- HBR: Why Agentic AI Projects Fail and How to Set Yours Up for Success(2025)
- Bloomberg: Klarna Turns From AI to Real Person Customer Service(2025)
- Fortune: Klarna Plans to Hire Humans Again, as New Landmark Survey Reveals Most AI Projects Fail to Deliver(2025)
- Entrepreneur: Klarna CEO Reverses Course By Hiring More Humans, Not AI(2025)
- Anthropic: Building Effective Agents(2024)
- OpenAI: A Practical Guide to Building Agents(2025)
- IBM Study: Businesses View AI Agents as Essential, Not Just Experimental(2025)
- Outreach: Agent Washing Exposed(2025)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.