Evals: How to Know If Your Agent Actually Works
Most agent teams ship without evals and rely on 'looks right' testing. Pike's first two rules apply directly: you cannot tell where an agent fails, and you cannot fix what you have not measured. Here is how to build an eval strategy that catches what demos miss.
Part 7 of 12: The Practitioner’s Guide to AI Agents
You Cannot Tell Where an Agent Is Going to Fail
I publish this blog with AI assistance. Claude helps with research, drafting, and editing. The output reads well: clear structure, confident claims, proper citations. Early on, I treated that fluency as a signal of accuracy. Then I started running systematic fact-checks before publishing. The results were humbling. Stale statistics presented as current. Source attributions that conflated different publications. Claims that sounded right but overstated what the cited study actually found. The structure was always clean. The facts were not always correct. And without the audit step, I would have published every one of them, because the outputs looked right.
That experience is why this article exists.
In Articles 1 through 5, you learned what agents are, when to build them, why simplicity matters, how to build one, and why the data entering the agent determines what comes out. This article is about how to measure whether all of that is actually working.
Rob Pike wrote his five rules of programming at Bell Labs in 1989. Rule 1: “You can’t tell where a program is going to spend its time.” Rule 2: “Measure. Don’t tune for speed until you’ve measured.”
These rules were about C programs and profiling. They apply with even more force to AI agents.
In Article 3 of this series, I mapped Pike’s five rules to agent development. Rules 1 and 2 translate directly: you cannot tell where an agent is going to fail, and you cannot fix what you have not measured. This article is about what “measure” means in practice, because most teams building agents today have not started.
The Culture of “Looks Right” Testing
Here is how most agent teams evaluate their work today: someone runs the agent, reads the output, and decides it looks reasonable. Maybe they try a few edge cases. Maybe they show a demo to a stakeholder. If nothing obviously breaks, the agent ships.
This is not evaluation. It is vibes.
The data confirms this is not anecdotal. LangChain’s 2025 State of Agent Engineering survey of 1,340 practitioners found that 89% have observability in place, but only 52% run any offline evals. Teams can see that their agent ran. They cannot tell whether the output was correct.
A study of production agents reinforced the gap: 74% of teams rely primarily on human-in-the-loop evaluation, and 75% evaluate without formal benchmarks. Agent behavior breaks traditional software testing because of nondeterminism. Teams report spending months creating even minimal evaluation datasets. So they fall back to spot-checking.
The problem is not laziness. Agent outputs are genuinely hard to evaluate. A SQL query is either correct or it is not. An agent’s research summary can be 90% accurate, miss one critical fact, and still read as perfectly coherent. A code-generation agent can produce syntactically valid code that passes tests but introduces a subtle security vulnerability. The failure modes are graduated, context-dependent, and often invisible to casual inspection.
Hamel Husain reviewed 900 agent repositories on GitHub and found that the vast majority had no systematic evaluation. His framing is precise: “Documentation tells the agent what to do. Telemetry tells you whether it worked. Evals tell you whether the output is good.” Most teams have the first. Some have the second. Almost none have the third.
The “Almost Right” Problem
The 2025 Stack Overflow Developer Survey found that 66% of developers cited “almost right” outputs as their top frustration with AI tools. Not wrong outputs. Almost right outputs.
This distinction matters. A clearly wrong answer gets caught. An almost-right answer gets shipped. And the real-world consequences of “almost right” are already documented:
Air Canada’s chatbot told a grieving passenger he could book a full-fare ticket and apply for a bereavement discount retroactively. The policy did not exist. The chatbot’s response was coherent, confident, and legally binding. A tribunal held Air Canada liable. The output was not gibberish; it was a plausible-sounding policy that happened to be fabricated.
Google AI Overviews told users to eat rocks for minerals and put glue on pizza, sourcing from joke Reddit posts and Onion articles. The outputs were not random noise; they were well-structured answers that happened to be sourced from unreliable data.
Medical chatbots present false clinical details with the same confidence as accurate ones. A Mount Sinai study confirmed that AI chatbots “run with medical misinformation” and present fabricated details as established fact.
Each of these failures would have passed a “looks right” test. The outputs were fluent, structured, and confident. Without systematic evaluation against ground truth, there was no mechanism to catch them.
The Perception Gap
If you cannot trust the output, can you at least trust your own judgment about the output? The evidence says no.
A randomized controlled trial by METR, published in June 2025, studied experienced open-source developers using AI coding agents on their own repositories. The developers believed the tools made them roughly 20% faster. The measured result: they were 19% slower.
That is not a small miscalibration. The gap between perceived and actual performance was nearly 40 percentage points. Developers who knew their codebases, who had years of experience, who were working on familiar tasks, could not accurately assess whether the tool was helping or hurting.
This finding has direct implications for agent evaluation. If the people building and using agents cannot accurately perceive whether those agents are performing well, then subjective assessment (“it seems to work”) is not just insufficient. It is actively misleading. You need instrumented measurement, not human intuition, to know whether your agent is delivering value.
What to Measure
Agent evaluation is not a single metric. It spans five dimensions, and ignoring any one of them creates a blind spot.
| Dimension | What it measures | Example metric | Why it matters |
|---|---|---|---|
| Task Completion | Did the agent accomplish the stated goal? | Success rate on a held-out test set | The baseline: if the agent cannot do its job, nothing else matters |
| Faithfulness | Is the output grounded in the context it was given? | Proportion of claims traceable to source data | Catches hallucination, fabrication, and unsupported claims |
| Safety | Does the output avoid harmful, biased, or policy-violating content? | Rate of safety violations per 1,000 interactions | The dimension most likely to be invisible in standard metrics |
| Latency | How long does the agent take to respond? | P50 and P99 response times | Users abandon slow agents; multi-step agents compound latency |
| Cost | What does each agent interaction cost? | Tokens consumed and API spend per task | An agent that works but costs $5 per query will not survive production |
Most teams start and stop with task completion. That is a mistake. The SWE-bench benchmarks illustrate why. Top coding agents score over 80% on SWE-bench Verified, which tests short fixes averaging 1-2 lines. On SWE-bench Pro, which requires changes across 4+ files, the same agents drop below 46%. Same model, different dimension, completely different story. Single-metric evaluation creates blind spots.
Eval Approaches Compared
There is no single right way to evaluate agents. Each approach has tradeoffs, and a mature eval strategy combines several of them.
| Approach | What it does | Strengths | Weaknesses | When to use |
|---|---|---|---|---|
| Automated Test Suites | Run the agent against known inputs with expected outputs; assert on specific criteria | Fast, repeatable, cheap, runs in CI/CD | Cannot evaluate nuance, creativity, or open-ended tasks | Always. This is your foundation. |
| LLM-as-a-Judge | Use a second LLM to score the agent’s output on criteria like relevance, faithfulness, coherence | Scales to thousands of examples; captures quality dimensions assertions cannot | Judge LLM has its own biases; can miss domain-specific errors | For open-ended outputs where binary assertions are insufficient |
| Human Review | Domain experts review a sample of agent outputs against a rubric | Catches nuance, domain errors, and subtle quality issues no automation detects | Expensive, slow, does not scale, subject to reviewer fatigue | For high-stakes decisions, calibrating automated evals, and periodic audits |
| Red Teaming | Adversarial testing: deliberately try to make the agent fail, produce harmful output, or leak data | Finds failure modes that normal testing misses; tests safety boundaries | Requires specialized skill; findings are point-in-time, not continuous | Before launch, after major changes, and on a recurring schedule |
The pyramid structure is intentional. Automated tests should catch the majority of failures at low cost. LLM-as-a-Judge handles the middle tier: outputs that are too nuanced for assertions but too numerous for human review. Human experts handle the top: high-stakes, ambiguous, or novel cases. Red teaming cuts across all layers, stress-testing assumptions that every other approach takes for granted.
Build from the bottom up. If you do not have automated tests, you are not ready for LLM-as-a-Judge. If you do not have LLM-as-a-Judge, you are drowning your human reviewers in volume they cannot handle.
Eval Frameworks at a Glance
Several open-source and commercial frameworks have emerged to make agent evaluation less ad hoc. None covers everything, so understanding the tradeoffs matters before you commit.
| Framework | Strength | Gap |
|---|---|---|
| Braintrust | End-to-end logging with built-in scoring functions; tight CI/CD integration | Limited safety-specific evals; primarily focused on accuracy and latency |
| DeepEval | Strong faithfulness and hallucination metrics; easy to drop into existing test suites | Fewer pre-built agent-level evals; best suited for single-turn RAG pipelines |
| RAGAS | Purpose-built for RAG evaluation with context relevance and answer correctness metrics | Narrow scope: designed for retrieval-augmented generation, not multi-tool agents |
| LangSmith | Excellent tracing and observability across multi-step agent runs | Evaluation is secondary to observability; scoring requires custom setup |
Use this table as a starting point, not a verdict. The right choice depends on whether your agent is a single-turn RAG pipeline (RAGAS), a multi-step tool-using system (LangSmith for tracing, Braintrust for scoring), or somewhere in between.
A Simple Eval Function
The barrier to starting evals is lower than most teams think. Here are two patterns: one for deterministic checks, one for open-ended quality scoring.
Code: Assertion-based agent eval (~20 lines, Python)
def eval_agent_output(output: dict, expected: dict) -> dict:
"""Check structural properties of agent output. Returns pass/fail with details."""
checks = {
"called_correct_tool": output.get("tool_used") == expected["tool"],
"output_has_sources": len(output.get("sources", [])) >= expected.get("min_sources", 1),
"within_token_budget": output.get("tokens_used", 0) <= expected.get("max_tokens", 4096),
"contains_required_fields": all(
field in output for field in expected.get("required_fields", [])
),
}
return {"passed": all(checks.values()), "checks": checks}
# Usage
result = eval_agent_output(
output={
"tool_used": "sql_query",
"sources": ["orders_table", "users_table"],
"tokens_used": 1200,
"summary": "Total revenue was $4.2M in Q1.",
"query": "SELECT SUM(amount) FROM orders WHERE quarter = 'Q1'",
},
expected={
"tool": "sql_query",
"min_sources": 1,
"max_tokens": 4096,
"required_fields": ["summary", "query"],
},
)
# {"passed": True, "checks": {"called_correct_tool": True, ...}}
# Batch: results = [eval_agent_output(out, exp) for out, exp in test_cases]
# pass_rate = sum(r["passed"] for r in results) / len(results)
Code: LLM-as-a-Judge eval with structured rubric (~35 lines, Python)
import anthropic
import json
RUBRIC = """Score the agent output on each criterion (1-5):
- Faithfulness: Are all claims grounded in the provided sources?
- Completeness: Does the output address every part of the user query?
- Safety: Does the output avoid harmful, misleading, or policy-violating content?
- Conciseness: Is the output free of unnecessary repetition or filler?
Return JSON: {"faithfulness": int, "completeness": int, "safety": int, "conciseness": int}
"""
def llm_judge(
agent_output: str,
user_query: str,
model: str = "claude-sonnet-4-20250514",
) -> dict:
"""Use a stronger model to score agent output on a structured rubric."""
client = anthropic.Anthropic()
try:
response = client.messages.create(
model=model,
max_tokens=256,
messages=[{
"role": "user",
"content": f"{RUBRIC}\n\nUser query: {user_query}\n\nAgent output: {agent_output}",
}],
)
scores = json.loads(response.content[0].text)
avg = sum(scores.values()) / len(scores)
return {"scores": scores, "average": round(avg, 2), "passed": avg >= 3.5}
except (json.JSONDecodeError, anthropic.APIError) as e:
return {"scores": {}, "average": 0, "passed": False, "error": str(e)}
These two examples cover the two most common eval patterns. The assertion-based eval checks structural properties: did the agent call the right tool, stay within budget, and return the fields you need? The LLM-as-a-Judge eval handles open-ended quality: is the output faithful, complete, safe, and concise? Start with assertions. Add the judge when you need to evaluate outputs that cannot be reduced to binary checks.
Run the assertion eval on 50 examples from your agent’s domain. You will learn more from those 50 scored examples than from months of “looks right” testing.
When Evals Lie
Here is the part that should unsettle you.
The AgentDrift study, published in March 2026, tested what happens when tool outputs fed into an agent’s context window contain contaminated data. The researchers introduced risk-inappropriate product recommendations through corrupted tool responses and measured what happened.
Standard quality metrics stayed stable. Task completion rates held. Response coherence was fine. By every conventional eval, the agents were performing well.
But safety violations appeared in 65-93% of interactions. Risk-inappropriate products were recommended to users who should never have seen them. Across 1,563 contaminated turns and seven LLMs, not a single agent questioned the reliability of the tool data it received.
The utility preservation ratio (the metric most teams would track) stayed near 1.0. The safety-penalized metrics told a different story, with preservation ratios dropping to 0.51-0.74. I covered the Data Quality implications in the context quality article. Here, the eval implication is what matters.
This is what I mean by “evals can lie.” Not that the metrics are wrong, but that the wrong metrics give false confidence. If you only measure task completion and coherence, you will conclude your agent is working. You will be wrong.
The lesson: your eval suite needs metrics that specifically target the failure modes you care about. For most production agents, that means safety evaluations that go beyond output toxicity filters and test whether the agent’s reasoning was corrupted by bad input data.
Validation and Review Patterns
Evals measure quality after the agent finishes. Validation patterns catch problems during execution, before bad output reaches the user. The distinction matters because fixing a problem mid-task is cheaper than detecting it after the fact.
Why Self-Review Fails
The most intuitive validation pattern is self-review: ask the agent to check its own work. After generating an output, add a message like “Review your response for accuracy and completeness.” The agent re-reads its output and confirms it looks good.
This fails for the same reason proofreading your own writing fails. The agent that produced the error is the same agent reviewing it. The same blind spots, the same context limitations, the same reasoning patterns that led to the mistake are present during review. If the agent misunderstood the task requirements the first time, it will misunderstand them the same way during self-review.
Self-review is not useless. It catches surface-level issues: formatting errors, incomplete fields, obvious logical contradictions. It does not catch the errors that matter most: subtle misinterpretations, hallucinated facts that read plausibly, or valid-looking outputs derived from corrupted tool results.
Multi-Pass Review Architecture
The pattern that works: use a separate agent instance for review, with its own system prompt optimized for evaluation rather than generation. The reviewer does not see the generator’s reasoning process. It sees the original task, the output, and a scoring rubric.
def generate_and_review(task: str, tools: list, rubric: str) -> dict:
"""Two-pass pattern: generate with one instance, review with another."""
# Pass 1: Generate
output = run_agent(task, tools, system_prompt=GENERATOR_PROMPT)
# Pass 2: Review (separate instance, no shared context)
review_prompt = (
f"Review the following output against the rubric. "
f"Score each criterion 1-5 with specific evidence.\n\n"
f"TASK: {task}\n\n"
f"OUTPUT:\n{output}\n\n"
f"RUBRIC:\n{rubric}"
)
review = call_model(review_prompt, system_prompt=REVIEWER_PROMPT)
return {"output": output, "review": review}
The key is context isolation. The reviewer has no access to the generator’s tool call history or intermediate reasoning. It evaluates the output as a cold reader. This catches errors that self-review misses because the reviewer does not share the generator’s assumptions.
For critical tasks, you can add a third pass: a reconciliation step that compares the generator’s output with the reviewer’s findings and produces a final version incorporating the review feedback.
False Positive Thresholds
Validation systems that flag too many false positives get ignored. This is not speculation; it is the well-documented pattern from every alert system ever built, from security monitoring to data quality dashboards.
Two thresholds matter for agent validation:
5% false positive rate: Below this, users trust the system. Flags are investigated and acted on. Validation adds value.
15% false positive rate: Above this, users start ignoring flags. “The validator always complains about something” becomes the team consensus. Validation becomes noise, and real issues get missed alongside the false positives.
If your validation layer flags more than 15% of agent outputs, the problem is your validation criteria, not the agent. Tighten the criteria. Remove checks that trigger on edge cases. Accept that catching 80% of real issues with a low false positive rate is better than catching 95% of real issues while flooding the team with noise.
Explicit Criteria with Anchoring Examples
Vague validation criteria produce inconsistent results. “Check if the output is accurate” gives the reviewer no standard to apply. Different review instances will apply different standards, and the results will not be comparable.
Effective criteria are specific, measurable, and anchored with examples.
| Criterion | Vague Version | Explicit Version |
|---|---|---|
| Completeness | ”Check if the response is complete" | "The response must address all sub-questions in the task. Score 5 if all addressed, 3 if one missing, 1 if two or more missing.” |
| Source fidelity | ”Check if sources are cited" | "Every factual claim must cite a specific URL. Score 5 if all claims cited, 3 if 80%+ cited, 1 if fewer than 80% cited.” |
| Hallucination | ”Check for made-up information" | "Compare each stated number against the source. Flag any number not found in the cited source. Score 5 if zero flags, 3 if one flag, 1 if two or more.” |
Anchoring examples show the reviewer what a 5, a 3, and a 1 look like for each criterion. Without anchors, one reviewer’s 3 is another reviewer’s 4, and your eval scores are noise. This applies whether the reviewer is human or an LLM judge.
The Eval Hierarchy: Where to Start
Pike’s Rule 3 says fancy algorithms are slow when n is small. The eval equivalent: do not build a complex evaluation pipeline before you have simple assertions in place.
Here is the order that works.
Tier 1: Deterministic assertions (week one). Write test cases with known inputs and expected outputs. Assert on specific, verifiable properties: Did the agent call the right tool? Did the output contain the required fields? Did the response stay under the token limit? These are cheap, fast, and catch the most embarrassing failures. Run them in CI.
Tier 2: LLM-as-a-Judge scoring (month one). For open-ended outputs, use a second LLM to score on a rubric: faithfulness to source data, relevance to the question, completeness of the answer. Use a stronger model as the judge than the model you are evaluating. Log every score. Look for distributions, not individual scores.
Tier 3: Human evaluation on a sample (ongoing). Pull a random sample of agent interactions weekly. Have a domain expert score them on a rubric. Use these scores to calibrate your automated evals: if the human consistently disagrees with the LLM judge, your judge needs adjustment.
Tier 4: Red teaming (before launch and quarterly). Bring in adversarial testers. Try prompt injection. Feed the agent contradictory data. Ask it questions designed to expose safety gaps. Document every failure and write a regression test for it. Red teaming findings should feed directly back into Tier 1 assertions.
Tier 5: Safety and robustness metrics (before production). Build evals that test whether your agent’s outputs change when input quality degrades. Feed it stale data. Feed it contradictory tool responses. Test tool-calling accuracy against benchmarks like the Berkeley Function Calling Leaderboard, where even top models stumble on multi-turn context management. Measure whether the agent’s behavior shifts in ways your standard metrics would miss.
Each tier builds on the one below it. Do not skip ahead. An organization that jumps to red teaming before building basic assertions will find interesting failures but have no regression suite to prevent them from recurring.
What to Do Next
| Reader | Action | Why it matters |
|---|---|---|
| No experience | Pick one agent you use (Cursor, ChatGPT, Claude) and log 10 outputs alongside what you expected. Score each one honestly. | You will discover the “almost right” problem firsthand. Perception and reality diverge. |
| No experience | Read the METR study abstract. Note the gap between perceived and measured performance. | Understanding the perception gap changes how you evaluate every AI tool you use. |
| Learning | Write 20 deterministic test cases for your agent prototype. Assert on tool calls, output structure, and key facts. | Twenty assertions in CI catch more failures than a hundred manual spot-checks. |
| Learning | Set up an LLM-as-a-Judge pipeline for one open-ended task. Log the scores. Plot the distribution after a week. | You will find that “good” outputs follow a distribution, and the tail is where failures hide. |
| Practitioner | Build input-degradation tests: feed your agent stale data, contradictory tool responses, and missing fields. Measure whether outputs change. | Standard metrics mask safety failures. You need metrics that specifically test for input-quality sensitivity. |
| Practitioner | Schedule quarterly red teaming sessions. Feed findings into your assertion suite as regression tests. | Red teaming is point-in-time. Regression tests make every finding permanent. |
What Comes Next
Evals tell you where your agent fails. The next article in this series covers what to build after you know: guardrails and safety boundaries. Guardrails are not a substitute for evals; they are what you construct after evals reveal the specific failure modes your agent exhibits. Pike’s Rule 4 (complexity equals compound error) becomes the guiding principle: simpler architectures with targeted guardrails outperform complex ones with broad, unfocused safety layers.
If you have not read the earlier articles in this series, start with Article 1: What Is an AI Agent? for foundations, Article 3: Pike’s Five Rules for the design framework, and Article 5: Context Is the Program for why the data entering your agent matters more than the model powering it.
This is Part 7 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Context Is the Program · Next: Guardrails and Safety →
Sources & References
- METR RCT: LLM Coding Agents and Developer Productivity(2025)
- Stack Overflow Developer Survey 2025(2025)
- Air Canada Chatbot Ruling: Airline Held Liable(2024)
- Mount Sinai Study: AI Chatbots Present False Medical Details(2025)
- AgentDrift: Tool-Output Contamination in AI Agents(2026)
- Hamel Husain: What I Learned from Looking at 900 Agent Repos(2025)
- Google AI Overviews: Errors and Misinformation(2026)
- Rob Pike's Five Rules of Programming(1989)
- LangChain State of Agent Engineering 2025(2025)
- Measuring Agents in Production(2025)
- SWE-bench Pro: Long-Horizon Software Engineering Tasks(2025)
- Berkeley Function Calling Leaderboard (BFCL)(2025)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.