Build a Real Agent This Weekend: From Zero to a Working Research Assistant
The series has defined agents, established design principles, and mapped failure modes. This article builds one. A complete research assistant agent with three tools, structured error handling with error categories and retry logic, context management, and a basic eval, all in one runnable Python file using the Anthropic SDK.
Part 4 of 12: The Practitioner’s Guide to AI Agents
The Gap Between Reading and Building
I read every agent tutorial published in 2025. I studied the Anthropic SDK docs. I bookmarked Simon Willison’s engineering patterns and Karpathy’s AutoResearch repo. I understood, conceptually, how agents worked: LLM reasons, calls a tool, observes the result, loops.
Then I sat down to build one and stared at an empty file for twenty minutes.
The tutorials showed the happy path. Call the API, get a tool response, return the answer. None of them showed what happens when the tool times out. Or when the model returns a malformed response. Or when the context window fills up mid-task. Or when the agent loops fifteen times and never decides to stop. The gap between “I understand agents” and “I can build a reliable one” was wider than any tutorial acknowledged.
This article closes that gap. We are going to build a research assistant agent: three tools, full error handling, context management, loop termination, and a basic eval. One Python file, roughly 150 lines, runnable with nothing but an Anthropic API key.
By the end, you will have built something real. Not a weather lookup toy. A system that takes a research question, searches for sources, reads pages, takes notes, and synthesizes a cited brief.
What We Are Building
The research assistant takes a question from the user and produces a sourced research brief. Here is the architecture.
The agent has three tools:
- web_search: Takes a query, returns a list of search results with titles, URLs, and snippets.
- read_page: Takes a URL, returns the page content (text extracted from HTML).
- save_note: Takes a note with source attribution and saves it to the agent’s working memory.
The agent loop runs until one of two conditions is met: the agent has gathered enough sources and decides to synthesize, or the iteration limit (10) is reached. Every tool call is wrapped in error handling. Token usage is tracked after each iteration.
The Tools
Each tool needs a name, a description, and an input schema. The description matters more than you might expect. It is the only thing the LLM reads when deciding which tool to call and how to call it. A vague description produces vague tool usage.
View code: tool definitions
TOOLS = [
{
"name": "web_search",
"description": (
"Search the web for information on a topic. Returns a list of results, "
"each with a title, URL, and snippet. Use this to find relevant sources "
"for the research question. Prefer specific queries over broad ones."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific and targeted.",
}
},
"required": ["query"],
},
},
{
"name": "read_page",
"description": (
"Read the content of a web page given its URL. Returns the extracted "
"text content. Use this after web_search to read promising results in "
"full. Do not read more than 3 pages per research question."
),
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The full URL of the page to read.",
}
},
"required": ["url"],
},
},
{
"name": "save_note",
"description": (
"Save a research note with source attribution. Use this to record key "
"findings as you research. Each note should contain a single claim or "
"finding, the source URL, and a brief explanation. Save notes as you go; "
"do not wait until the end."
),
"input_schema": {
"type": "object",
"properties": {
"claim": {
"type": "string",
"description": "The key finding or claim.",
},
"source_url": {
"type": "string",
"description": "The URL where this information was found.",
},
"explanation": {
"type": "string",
"description": "Brief context explaining the claim.",
},
},
"required": ["claim", "source_url", "explanation"],
},
},
]
Three details worth noting. First, the web_search description tells the model to prefer specific queries. Without this guidance, models tend to search for the entire research question verbatim, which produces poor results. Second, the read_page description caps reading at three pages. This is a soft constraint the model respects most of the time, and it prevents the agent from burning through its iteration budget reading every search result. Third, save_note instructs the model to save notes incrementally. Without this, the model tends to hold everything in its reasoning and attempt a single massive synthesis at the end, which is fragile.
The Tool Implementations
Since web search and page reading require external APIs, we use mock implementations that return realistic data. The code is designed so you can swap in real implementations with minimal changes.
View code: tool implementations (mocks with swap instructions)
import json
import time
# Research notes accumulate here across iterations
research_notes: list[dict] = []
def web_search(query: str, timeout: int = 10) -> str:
"""Mock web search. Returns realistic results for any query.
To use a real search API, replace this function body with:
import httpx
resp = httpx.get(
"https://api.tavily.com/search",
params={"query": query, "api_key": os.environ["TAVILY_API_KEY"]},
timeout=timeout,
)
return resp.text
"""
results = [
{
"title": f"Research findings on: {query}",
"url": f"https://example.com/research/{query.replace(' ', '-')[:30]}",
"snippet": (
f"A comprehensive analysis of {query}. Key findings include "
"measurable improvements in efficiency and documented trade-offs "
"in implementation complexity."
),
},
{
"title": f"Industry report: {query}",
"url": f"https://example.com/report/{query.replace(' ', '-')[:30]}",
"snippet": (
f"2025 industry data on {query}. Survey of 500 practitioners "
"reveals adoption rates of 34% in enterprise settings, with "
"significant variation by sector."
),
},
{
"title": f"Case study: {query} in practice",
"url": f"https://example.com/case/{query.replace(' ', '-')[:30]}",
"snippet": (
f"Real-world implementation of {query} at a Fortune 500 company. "
"Reduced manual effort by 60% over 18 months. Lessons learned "
"and failure modes documented."
),
},
]
return json.dumps(results)
def read_page(url: str, timeout: int = 15) -> str:
"""Mock page reader. Returns realistic page content.
To use a real page reader, replace this function body with:
import httpx
from bs4 import BeautifulSoup
resp = httpx.get(url, timeout=timeout, follow_redirects=True)
soup = BeautifulSoup(resp.text, "html.parser")
return soup.get_text(separator="\\n", strip=True)[:5000]
"""
return (
f"Content from {url}\n\n"
"Key findings from this source:\n"
"1. Organizations that implemented structured approaches saw 40-60% "
"improvement in target metrics over 12-18 months.\n"
"2. The most common failure mode was insufficient stakeholder alignment, "
"not technical complexity.\n"
"3. Teams that measured outcomes from day one iterated 3x faster than "
"teams that deferred measurement.\n"
"4. Cost of implementation ranged from $50K-$500K depending on scope, "
"with median ROI positive within 9 months.\n\n"
"The study surveyed 200 organizations across financial services, "
"healthcare, and technology sectors between 2023 and 2025."
)
def save_note(claim: str, source_url: str, explanation: str) -> str:
"""Save a research note with source attribution."""
note = {
"claim": claim,
"source_url": source_url,
"explanation": explanation,
"saved_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
}
research_notes.append(note)
return json.dumps({"status": "saved", "total_notes": len(research_notes)})
The mock functions return enough realistic data for the agent to reason over meaningfully. The comments show exactly what to replace for a production implementation: Tavily for search, httpx plus BeautifulSoup for page reading. The save_note function is already production-ready; it just appends to an in-memory list.
The Agent Loop
This is the core of the agent. The loop calls Claude, dispatches tool calls, handles errors, tracks token usage, and enforces the iteration limit.
View code: the complete agent (one runnable file)
#!/usr/bin/env python3
"""Research assistant agent. Requires: pip install anthropic"""
import anthropic
import json
import sys
import time
# --- Configuration ---
MODEL = "claude-sonnet-4-20250514"
MAX_TOKENS = 4096
MAX_ITERATIONS = 10
TOKEN_WARNING_THRESHOLD = 80_000 # warn when input tokens approach this
TOOL_TIMEOUT = 15 # seconds
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
# --- Tool definitions (TOOLS list from above) ---
TOOLS = [
{
"name": "web_search",
"description": (
"Search the web for information on a topic. Returns a list of results, "
"each with a title, URL, and snippet. Use this to find relevant sources "
"for the research question. Prefer specific queries over broad ones."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific and targeted.",
}
},
"required": ["query"],
},
},
{
"name": "read_page",
"description": (
"Read the content of a web page given its URL. Returns the extracted "
"text content. Use this after web_search to read promising results in "
"full. Do not read more than 3 pages per research question."
),
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The full URL of the page to read.",
}
},
"required": ["url"],
},
},
{
"name": "save_note",
"description": (
"Save a research note with source attribution. Use this to record key "
"findings as you research. Each note should contain a single claim or "
"finding, the source URL, and a brief explanation. Save notes as you go; "
"do not wait until the end."
),
"input_schema": {
"type": "object",
"properties": {
"claim": {
"type": "string",
"description": "The key finding or claim.",
},
"source_url": {
"type": "string",
"description": "The URL where this information was found.",
},
"explanation": {
"type": "string",
"description": "Brief context explaining the claim.",
},
},
"required": ["claim", "source_url", "explanation"],
},
},
]
# --- Tool implementations ---
research_notes: list[dict] = []
def web_search(query: str, timeout: int = TOOL_TIMEOUT) -> str:
"""Mock web search. Replace body with Tavily API call for production."""
results = [
{
"title": f"Research findings on: {query}",
"url": f"https://example.com/research/{query.replace(' ', '-')[:30]}",
"snippet": (
f"A comprehensive analysis of {query}. Key findings include "
"measurable improvements in efficiency and documented trade-offs."
),
},
{
"title": f"Industry report: {query}",
"url": f"https://example.com/report/{query.replace(' ', '-')[:30]}",
"snippet": (
f"2025 industry data on {query}. Survey of 500 practitioners "
"reveals adoption rates of 34% in enterprise settings."
),
},
{
"title": f"Case study: {query} in practice",
"url": f"https://example.com/case/{query.replace(' ', '-')[:30]}",
"snippet": (
f"Real-world implementation of {query} at a Fortune 500 company. "
"Reduced manual effort by 60% over 18 months."
),
},
]
return json.dumps(results)
def read_page(url: str, timeout: int = TOOL_TIMEOUT) -> str:
"""Mock page reader. Replace body with httpx + BeautifulSoup for production."""
return (
f"Content from {url}\n\n"
"Key findings:\n"
"1. Structured approaches yielded 40-60% improvement over 12-18 months.\n"
"2. Most common failure mode: insufficient stakeholder alignment.\n"
"3. Teams measuring from day one iterated 3x faster.\n"
"4. Implementation cost: $50K-$500K, median ROI positive within 9 months.\n\n"
"Survey of 200 organizations across financial services, healthcare, "
"and technology sectors (2023-2025)."
)
def save_note(claim: str, source_url: str, explanation: str) -> str:
"""Save a research note with source attribution."""
note = {
"claim": claim,
"source_url": source_url,
"explanation": explanation,
"saved_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
}
research_notes.append(note)
return json.dumps({"status": "saved", "total_notes": len(research_notes)})
# Map tool names to functions
TOOL_DISPATCH = {
"web_search": lambda args: web_search(args["query"]),
"read_page": lambda args: read_page(args["url"]),
"save_note": lambda args: save_note(
args["claim"], args["source_url"], args["explanation"]
),
}
def execute_tool(name: str, args: dict) -> str:
"""Execute a tool call with error handling and timeout tracking."""
if name not in TOOL_DISPATCH:
return json.dumps({"error": f"Unknown tool: {name}"})
try:
start = time.time()
result = TOOL_DISPATCH[name](args)
elapsed = time.time() - start
if elapsed > TOOL_TIMEOUT:
return json.dumps({
"error": f"Tool {name} took {elapsed:.1f}s (limit: {TOOL_TIMEOUT}s)",
"partial_result": result[:500] if result else None,
})
return result
except Exception as e:
return json.dumps({"error": f"Tool {name} failed: {str(e)}"})
def run_agent(question: str) -> str:
"""Run the research assistant agent on a question."""
system_prompt = (
"You are a research assistant. Your job is to answer the user's research "
"question by searching the web, reading relevant pages, and taking notes "
"with source attribution.\n\n"
"Process:\n"
"1. Search for relevant sources using web_search.\n"
"2. Read the most promising pages using read_page (max 3 pages).\n"
"3. Save key findings using save_note as you discover them.\n"
"4. Once you have enough information (at least 3 notes from 2+ sources), "
"synthesize a brief with citations.\n\n"
"Rules:\n"
"- Every claim in your final brief must cite a source URL.\n"
"- If search results are insufficient, try a different query.\n"
"- If a page fails to load, skip it and try the next result.\n"
"- Be concise. The brief should be 200-400 words."
)
messages = [{"role": "user", "content": question}]
total_input_tokens = 0
total_output_tokens = 0
for iteration in range(1, MAX_ITERATIONS + 1):
print(f"\n--- Iteration {iteration}/{MAX_ITERATIONS} ---")
# Call the LLM
try:
response = client.messages.create(
model=MODEL,
max_tokens=MAX_TOKENS,
system=system_prompt,
tools=TOOLS,
messages=messages,
)
except anthropic.APIError as e:
print(f" API error: {e}")
break
# Track token usage
total_input_tokens += response.usage.input_tokens
total_output_tokens += response.usage.output_tokens
print(f" Tokens: {response.usage.input_tokens} in, "
f"{response.usage.output_tokens} out "
f"(cumulative: {total_input_tokens} in, {total_output_tokens} out)")
# Warn if approaching context limit
if total_input_tokens > TOKEN_WARNING_THRESHOLD:
print(f" WARNING: Input tokens ({total_input_tokens}) approaching "
f"limit ({TOKEN_WARNING_THRESHOLD}). Agent should wrap up.")
# Check stop reason
if response.stop_reason == "end_turn":
# Agent is done; extract final text
final_text = ""
for block in response.content:
if hasattr(block, "text"):
final_text += block.text
print(f"\n Agent finished after {iteration} iterations.")
print(f" Total tokens: {total_input_tokens} in, "
f"{total_output_tokens} out")
print(f" Notes saved: {len(research_notes)}")
return final_text
if response.stop_reason != "tool_use":
# Unexpected stop reason
print(f" Unexpected stop_reason: {response.stop_reason}")
final_text = ""
for block in response.content:
if hasattr(block, "text"):
final_text += block.text
return final_text or "Agent stopped unexpectedly."
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f" Tool call: {block.name}({json.dumps(block.input)[:80]}...)")
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
# Append assistant message and tool results to conversation
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
# If we hit max iterations, ask for a final synthesis
print(f"\n Max iterations ({MAX_ITERATIONS}) reached. Requesting synthesis.")
messages.append({
"role": "user",
"content": (
"You have reached the maximum number of research iterations. "
"Synthesize your findings into a brief now, using the notes you "
"have saved. Cite sources for every claim."
),
})
try:
response = client.messages.create(
model=MODEL, max_tokens=MAX_TOKENS,
system=system_prompt, tools=TOOLS, messages=messages,
)
for block in response.content:
if hasattr(block, "text"):
return block.text
except anthropic.APIError as e:
return f"Final synthesis failed: {e}"
return "Agent completed without producing a final brief."
# --- Entry point ---
if __name__ == "__main__":
question = (
sys.argv[1] if len(sys.argv) > 1
else "What are the most effective approaches to implementing "
"data quality monitoring in enterprise data platforms?"
)
print(f"Research question: {question}\n")
brief = run_agent(question)
print("\n" + "=" * 60)
print("RESEARCH BRIEF")
print("=" * 60)
print(brief)
That is the entire agent in one file. Let me walk through the pieces that matter.
The Loop Explained
The core is a for loop bounded by MAX_ITERATIONS. This is not a while True. I used a bounded loop deliberately. An unbounded loop with a break condition inside is the most common source of runaway agents. The bounded loop guarantees termination.
for iteration in range(1, MAX_ITERATIONS + 1):
response = client.messages.create(...)
if response.stop_reason == "end_turn":
return final_text # Agent decided it's done
if response.stop_reason != "tool_use":
return "Agent stopped unexpectedly." # Defensive
# Process tool calls and continue
Three things happen on every iteration. The LLM is called with the full conversation history. The stop reason is checked: end_turn means the agent wants to return a final answer, tool_use means it wants to call a tool. Anything else is treated as an error. This three-way check handles the happy path, the completion path, and the defensive path in six lines.
Error Handling: The Part Tutorials Skip
The execute_tool function wraps every tool call in a try/except. If a tool crashes, the agent receives an error message and keeps running. This matters because real tool calls fail regularly. APIs time out. Pages return 403. JSON parsing breaks on unexpected HTML.
def execute_tool(name: str, args: dict) -> str:
if name not in TOOL_DISPATCH:
return json.dumps({"error": f"Unknown tool: {name}"})
try:
result = TOOL_DISPATCH[name](args)
return result
except Exception as e:
return json.dumps({"error": f"Tool {name} failed: {str(e)}"})
When a tool returns an error, the LLM receives that error as a tool result. Claude is good at reading error messages and adapting. If read_page fails on a URL, the agent typically tries the next search result. If web_search fails, it reformulates the query. This graceful degradation is not automatic; it comes from the system prompt instruction: “If a page fails to load, skip it and try the next result.”
The execute_tool function also checks for unknown tool names. This handles the case where the model hallucinates a tool that does not exist. Without this check, you get a KeyError that crashes the entire agent.
Beyond Try/Except: Structured Error Responses
The basic execute_tool function catches errors and returns a generic JSON string: {"error": "Tool read_page failed: ConnectionTimeout"}. That is enough to keep the agent running, but it is not enough for the agent to make a good decision about what to do next. A connection timeout and an invalid URL are both errors, but they require opposite responses: retry the first, abandon the second.
Production agents need structured error responses. Every tool error should carry three pieces of information.
@dataclass
class ToolError:
message: str # Human-readable description
errorCategory: str # transient | validation | business | permission | not_found
isRetryable: bool # Can the LLM retry this exact call?
suggestion: str = None # What the model should do next
The five error categories drive the agent’s recovery behavior:
| Category | When It Applies | Retryable? | Agent Should… |
|---|---|---|---|
| transient | Timeouts, rate limits, temporary outages | Yes | Retry after a brief pause |
| validation | Bad input format, missing required field | No | Ask the user to correct the input |
| business | Valid request, but a business rule prevents it | No | Explain the constraint to the user |
| permission | Caller lacks access to the resource | No | Explain the access limitation |
| not_found | The requested entity does not exist | No | Ask the user to verify the identifier |
Here is the upgraded execute_tool function:
def execute_tool(name: str, args: dict) -> tuple[str, bool]:
"""Execute a tool call. Returns (result_json, is_error)."""
if name not in TOOL_DISPATCH:
return json.dumps({
"error": f"Unknown tool: {name}",
"errorCategory": "validation",
"isRetryable": False,
"suggestion": f"Available tools: {', '.join(TOOL_DISPATCH.keys())}",
}), True
try:
start = time.time()
result = TOOL_DISPATCH[name](args)
elapsed = time.time() - start
if elapsed > TOOL_TIMEOUT:
return json.dumps({
"error": f"Tool {name} took {elapsed:.1f}s (limit: {TOOL_TIMEOUT}s)",
"errorCategory": "transient",
"isRetryable": True,
"suggestion": "Retry with a simpler query or try a different source.",
}), True
return result, False
except ConnectionError:
return json.dumps({
"error": f"Could not connect to {name} service.",
"errorCategory": "transient",
"isRetryable": True,
"suggestion": "Retry after a brief pause.",
}), True
except ValueError as e:
return json.dumps({
"error": f"Invalid input: {str(e)}",
"errorCategory": "validation",
"isRetryable": False,
"suggestion": "Check the input format and try with corrected values.",
}), True
except Exception as e:
return json.dumps({
"error": f"Tool {name} failed: {str(e)}",
"errorCategory": "transient",
"isRetryable": True,
}), True
The is_error boolean returned alongside the result maps to the isError flag in tool results. When is_error is True, the model knows the result is an error, not data. Without this flag, the model treats error text as a successful result. It might say “Here are your results: connection timeout” and try to synthesize a brief from an error message.
”Checked and Found Nothing” vs. “Failed to Check”
The most important distinction in tool error handling is one that generic try/except misses entirely. Consider two scenarios for web_search("quantum computing applications in healthcare"):
- The search runs successfully but returns zero results.
- The search API is down and throws an exception.
Both produce “no results,” but they mean opposite things. Scenario 1 is a valid answer: there is nothing to find. Scenario 2 is a failure: we do not know whether there is something to find. The agent’s behavior should differ:
- Valid empty result (
is_error=False): “My search found no results for that query. Let me try a different search term.” - Access failure (
is_error=True): “The search service is unavailable. I cannot complete this research right now.”
In the research agent, this distinction matters because the agent decides when it has “enough information” to synthesize. If it treats a failed search as “no results exist,” it might synthesize a brief claiming there is limited research on a topic when in fact it simply could not reach the search API.
Validation-Retry: Not Blind Retry
When an error is retryable, the agent should not resubmit the identical request. That is blind retry, and it fails for the same reason the original call failed. Effective retry means modifying the request based on the specific error.
The pattern has three components: the original request, the failed result, and the specific error that caused the failure. All three go into the next attempt.
# Bad: blind retry (same input, same failure)
result = web_search("quantum computing healthcare") # times out
result = web_search("quantum computing healthcare") # times out again
# Good: retry with error feedback (simplified query)
result = web_search("quantum computing healthcare") # times out
# Agent reads the error: transient, isRetryable, suggestion says "simpler query"
result = web_search("quantum computing medical") # narrower query, succeeds
In the research agent, this behavior emerges from the system prompt combined with structured error responses. When the agent receives a transient error with a suggestion, it adapts its next tool call. When it receives a validation error, it asks the user for corrected input instead of retrying. The error structure replaces guesswork with a decision framework.
The basic error handling from the previous section keeps the agent alive. Structured error responses make it intelligent about recovery.
Context Management
Every iteration tracks token usage via response.usage.input_tokens and response.usage.output_tokens. When input tokens exceed the warning threshold (80,000), the agent prints a warning. This does not force the agent to stop, but it signals that the conversation history is growing large.
In a production agent, you would go further. Anthropic’s token counting API lets you check token counts before sending a request. You could add a hard cutoff that triggers synthesis when the context reaches 90% of the model’s limit. You could also implement context compression: summarizing older tool results instead of keeping the full text.
For a weekend build, the warning threshold plus the iteration limit is sufficient. The iteration limit is your primary safety net. Ten iterations with three tool calls each is thirty tool calls, which is more than enough for a research task.
Running It
Save the full code block above as research_agent.py and run:
pip install anthropic
export ANTHROPIC_API_KEY="your-key-here"
python research_agent.py "What are the best practices for AI governance in financial services?"
Here is representative output from a run:
Research question: What are the best practices for AI governance in financial services?
--- Iteration 1/10 ---
Tokens: 1247 in, 156 out (cumulative: 1247 in, 156 out)
Tool call: web_search({"query": "AI governance best practices financial services 2025"}...)
--- Iteration 2/10 ---
Tokens: 2103 in, 203 out (cumulative: 3350 in, 359 out)
Tool call: read_page({"url": "https://example.com/research/AI-governance-best-pract"}...)
--- Iteration 3/10 ---
Tokens: 3012 in, 189 out (cumulative: 6362 in, 548 out)
Tool call: save_note({"claim": "Structured AI governance approaches yielded 40-60%"}...)
Tool call: save_note({"claim": "Most common failure mode was insufficient stakeholde"}...)
--- Iteration 4/10 ---
Tokens: 3845 in, 167 out (cumulative: 10207 in, 715 out)
Tool call: read_page({"url": "https://example.com/report/AI-governance-best-pract"}...)
--- Iteration 5/10 ---
Tokens: 4521 in, 201 out (cumulative: 14728 in, 916 out)
Tool call: save_note({"claim": "34% adoption rate in enterprise settings for formal"}...)
--- Iteration 6/10 ---
Tokens: 5102 in, 312 out (cumulative: 19830 in, 1228 out)
Agent finished after 6 iterations.
Total tokens: 19830 in, 1228 out
Notes saved: 3
============================================================
RESEARCH BRIEF
============================================================
Based on research across industry reports and case studies, here are
the key findings on AI governance in financial services...
[Sources cited inline with URLs]
The agent searched once, read two pages, saved three notes, and synthesized a brief. Six iterations, under 20K input tokens. It decided on its own when it had enough information. That decision came from the system prompt instruction: “Once you have enough information (at least 3 notes from 2+ sources), synthesize a brief.”
What Can Go Wrong
Every failure mode in this agent has a corresponding defense. Here is the mapping.
| Failure Mode | What Happens | How the Code Handles It |
|---|---|---|
| Tool crash | read_page raises an exception (timeout, bad URL, DNS failure) | execute_tool catches the exception and returns a structured error with category and retry guidance. The LLM reads the category and adapts: retries transient failures, abandons validation errors. |
| Unknown tool | The model hallucinates a tool name like analyze_data | execute_tool checks TOOL_DISPATCH and returns a validation error with a list of available tools. The LLM picks a valid tool on the next iteration. |
| Empty result vs. failure | web_search returns zero results, or the search API is unreachable | Structured errors distinguish is_error=False (valid empty result: nothing found) from is_error=True (service failure: could not check). The agent adjusts its synthesis accordingly. |
| Runaway loop | The agent keeps searching without synthesizing | MAX_ITERATIONS = 10 caps the loop. If reached, the agent is forced to synthesize with whatever notes it has. |
| Context overflow | Conversation grows too long for the model | TOKEN_WARNING_THRESHOLD prints a warning. In production, add a hard cutoff that triggers immediate synthesis. |
| Malformed response | stop_reason is neither end_turn nor tool_use | The defensive branch returns whatever text the model produced, or a fallback message. |
| API failure | Anthropic API returns a 500, rate limit, or network error | anthropic.APIError catch breaks the loop and returns what the agent has so far. |
| Empty synthesis | The model finishes but returns no text | The final fallback: “Agent completed without producing a final brief.” |
The defense that matters most is the bounded loop. A while True loop with the wrong break condition can run indefinitely, burning tokens and money. The for loop with MAX_ITERATIONS makes termination a structural guarantee, not a behavioral hope.
Adding a Basic Eval
Building the agent is half the work. The other half is knowing whether the output is any good. Hamel Husain’s eval framework says it clearly: “Evals tell you whether the output is good.” Without them, you are judging by feel.
Here is a basic eval that checks three properties of the research brief.
View code: basic eval for the research agent
import re
def eval_research_brief(brief: str, notes: list[dict]) -> dict:
"""Score a research brief on three dimensions.
Returns a dict with scores and a pass/fail verdict.
"""
scores = {}
# 1. Source count: how many unique URLs are cited in the brief?
urls_in_brief = set(re.findall(r"https?://[^\s\)\"]+", brief))
scores["sources_cited"] = len(urls_in_brief)
scores["sources_pass"] = len(urls_in_brief) >= 2
# 2. Note coverage: what fraction of saved notes appear in the brief?
notes_referenced = 0
for note in notes:
# Check if the claim (or a substantial substring) appears in the brief
claim_words = note["claim"].lower().split()
# Match if at least 60% of claim words appear in the brief
brief_lower = brief.lower()
matches = sum(1 for w in claim_words if w in brief_lower)
if matches / max(len(claim_words), 1) >= 0.6:
notes_referenced += 1
scores["notes_referenced"] = notes_referenced
scores["notes_total"] = len(notes)
scores["coverage_pass"] = (
notes_referenced / max(len(notes), 1) >= 0.5
)
# 3. Length check: is the brief within the target range?
word_count = len(brief.split())
scores["word_count"] = word_count
scores["length_pass"] = 100 <= word_count <= 600
# Overall verdict
scores["passed"] = all([
scores["sources_pass"],
scores["coverage_pass"],
scores["length_pass"],
])
return scores
# Run the eval after the agent completes
if __name__ == "__main__":
# ... (after run_agent returns)
eval_result = eval_research_brief(brief, research_notes)
print("\n" + "=" * 60)
print("EVAL RESULTS")
print("=" * 60)
for key, value in eval_result.items():
print(f" {key}: {value}")
print(f"\n Verdict: {'PASS' if eval_result['passed'] else 'FAIL'}")
The eval checks three things. First, does the brief cite at least two distinct source URLs? A brief without citations is not research; it is the model’s training data repackaged. Second, does the brief reference at least half of the notes the agent saved? If the agent saved four notes but the brief only uses one, information was lost in synthesis. Third, is the brief within the target word count? Too short means insufficient detail. Too long means the model is padding.
This eval is simple. It does not check factual accuracy, logical coherence, or writing quality. Article 6 in this series covers how to build serious evals with LLM-as-a-Judge scoring and regression test suites. But this basic eval catches the most common failure modes: missing citations, dropped findings, and incorrect length. You can run it on every agent invocation in under a second. Start here.
Connecting to the Series
This agent is a direct application of the principles established in the first three articles.
From Article 1 (What Is an AI Agent): The agent has all four components. The LLM (Claude) is the reasoning engine. The tools (web_search, read_page, save_note) are the hands. Memory accumulates in the messages list and the research_notes array. The loop iterates until the goal is met or the iteration limit is reached.
From Article 3 (Pike’s Rules): We applied Rule 3 directly: start simple. One agent, one system prompt, three tools. No orchestration framework. No multi-agent routing. No vector database. The code uses the raw Anthropic SDK because Anthropic’s own guidance recommends starting without a framework. If this agent proves insufficient for your use case, you will know exactly which part needs more sophistication, because you understand every line.
Looking ahead: Article 5 explains why the data entering the agent’s context window matters more than the model. Notice how the system prompt structures what the agent pays attention to and how it processes tool results. That is Context Engineering in practice. Article 6 shows how to build evaluation pipelines that go beyond the basic eval we wrote here. Article 7 adds the safety layer: what happens when tool results contain prompt injection, stale data, or contradictions.
The intent was not to build the best possible research agent. It was to build one that demonstrates every concept the series teaches, in working code, so the remaining articles have a concrete artifact to reference.
Do Next
| Priority | Action | Why it matters |
|---|---|---|
| No experience | Copy the full agent code, install the anthropic package, set your API key, and run it with the default question. Read the iteration output. Watch the agent decide which tool to call and when to stop. | Reading agent code is different from watching it run. The iteration log shows you the decision loop in real time: search, read, note, synthesize. That loop is the concept from Article 1 made concrete. |
| No experience | Change the research question to something you actually want to know about. Run it again. Compare the two briefs. | A research agent on a topic you care about surfaces the quality questions immediately: did it find good sources? Did it miss something obvious? Did the brief make sense? These reactions are the start of building evaluation instincts. |
| Learning | Replace the mock web_search with a real search API (Tavily has a free tier). Run the agent on three different questions and compare the output quality. | Mock data produces predictable results. Real data produces surprising failures: irrelevant search results, pages that cannot be parsed, sources that contradict each other. These failures teach you more about agent reliability than any tutorial. |
| Learning | Add a fourth tool: check_claim that takes a claim and a source URL and verifies whether the source actually supports the claim. Wire it into the system prompt. | This is reasoning-layer validation from Article 7, applied inside the agent itself. You are building a self-checking agent, which previews the self-improvement patterns in Article 8. |
| Practitioner | Adapt the agent loop and error handling patterns to your domain. Replace the three research tools with tools relevant to your work: a Data Catalog lookup, a metric query, a schema validator. Keep the loop structure, the error handling, and the eval. | The architecture is domain-agnostic. The tools are domain-specific. The loop, error handling, and eval patterns transfer directly to any agent you build. Swap the tools, keep the skeleton. |
| Practitioner | Run the basic eval on 20 different research questions. Track pass rates across the three dimensions (sources, coverage, length). Identify which dimension fails most often. | Twenty runs with tracked metrics is a minimal eval suite. You will discover patterns: the agent consistently cites only one source, or it drops notes during synthesis, or it overshoots the word count. Each pattern points to a specific prompt or tool description improvement. |
This is Part 4 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Pike’s Five Rules for Agent Development · Next: Context Is the Program →
Sources & References
- Anthropic: Tool Use with Claude(2025)
- Anthropic: Building Effective Agents(2024)
- Anthropic Python SDK(2025)
- Tavily Search API(2025)
- Simon Willison: Agentic Engineering Patterns(2026)
- Hamel Husain: Evals Skills for Coding Agents(2026)
- Karpathy AutoResearch(2026)
- Anthropic: Message Batches and Token Counting(2025)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.