AI Products & Strategy March 25, 2026 · 18 min read

Build a Real Agent This Weekend: From Zero to a Working Research Assistant

The series has defined agents, established design principles, and mapped failure modes. This article builds one. A complete research assistant agent with three tools, structured error handling with error categories and retry logic, context management, and a basic eval, all in one runnable Python file using the Anthropic SDK.

By Vikas Pratap Singh

#ai-agents #agent-development #tool-calling #anthropic-sdk #context-engineering #python

Executive Briefing

What this covers: A complete, runnable research assistant agent built with the Anthropic SDK. Three tools (web search, page reader, note writer), structured error handling with error categories and retry logic, context management, loop termination, and a basic output eval.
Who should read it: Engineers, data practitioners, and technical leaders who have read about agents but have not built one end-to-end. This article bridges the gap between theory and working code.
Key takeaway: A non-trivial agent is roughly 200 lines of Python. The code is straightforward. The hard parts are the ones you do not see in tutorials: classifying errors so the agent knows whether to retry or escalate, managing context growth, knowing when to stop, and measuring whether the output is any good.
Why it matters now: The series so far has defined agents (Article 1), established design principles (Article 3), but has not built anything beyond a 15-line weather lookup. You cannot internalize agent patterns by reading about them. You internalize them by building.

Part 4 of 12: The Practitioner’s Guide to AI Agents

The Gap Between Reading and Building

I read every agent tutorial published in 2025. I studied the Anthropic SDK docs. I bookmarked Simon Willison’s engineering patterns and Karpathy’s AutoResearch repo. I understood, conceptually, how agents worked: LLM reasons, calls a tool, observes the result, loops.

Then I sat down to build one and stared at an empty file for twenty minutes.

The tutorials showed the happy path. Call the API, get a tool response, return the answer. None of them showed what happens when the tool times out. Or when the model returns a malformed response. Or when the context window fills up mid-task. Or when the agent loops fifteen times and never decides to stop. The gap between “I understand agents” and “I can build a reliable one” was wider than any tutorial acknowledged.

This article closes that gap. We are going to build a research assistant agent: three tools, full error handling, context management, loop termination, and a basic eval. One Python file, roughly 150 lines, runnable with nothing but an Anthropic API key.

By the end, you will have built something real. Not a weather lookup toy. A system that takes a research question, searches for sources, reads pages, takes notes, and synthesizes a cited brief.

What We Are Building

The research assistant takes a question from the user and produces a sourced research brief. Here is the architecture.

Research assistant agent architecture: user question flows into the agent loop (LLM reasoning, tool selection, web_search/read_page/save_note, result validation, loop decision) and produces a final brief

The agent has three tools:

web_search: Takes a query, returns a list of search results with titles, URLs, and snippets.
read_page: Takes a URL, returns the page content (text extracted from HTML).
save_note: Takes a note with source attribution and saves it to the agent’s working memory.

The agent loop runs until one of two conditions is met: the agent has gathered enough sources and decides to synthesize, or the iteration limit (10) is reached. Every tool call is wrapped in error handling. Token usage is tracked after each iteration.

The Tools

Each tool needs a name, a description, and an input schema. The description matters more than you might expect. It is the only thing the LLM reads when deciding which tool to call and how to call it. A vague description produces vague tool usage.

View code: tool definitions

TOOLS = [
    {
        "name": "web_search",
        "description": (
            "Search the web for information on a topic. Returns a list of results, "
            "each with a title, URL, and snippet. Use this to find relevant sources "
            "for the research question. Prefer specific queries over broad ones."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query. Be specific and targeted.",
                }
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_page",
        "description": (
            "Read the content of a web page given its URL. Returns the extracted "
            "text content. Use this after web_search to read promising results in "
            "full. Do not read more than 3 pages per research question."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The full URL of the page to read.",
                }
            },
            "required": ["url"],
        },
    },
    {
        "name": "save_note",
        "description": (
            "Save a research note with source attribution. Use this to record key "
            "findings as you research. Each note should contain a single claim or "
            "finding, the source URL, and a brief explanation. Save notes as you go; "
            "do not wait until the end."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "claim": {
                    "type": "string",
                    "description": "The key finding or claim.",
                },
                "source_url": {
                    "type": "string",
                    "description": "The URL where this information was found.",
                },
                "explanation": {
                    "type": "string",
                    "description": "Brief context explaining the claim.",
                },
            },
            "required": ["claim", "source_url", "explanation"],
        },
    },
]

Three details worth noting. First, the web_search description tells the model to prefer specific queries. Without this guidance, models tend to search for the entire research question verbatim, which produces poor results. Second, the read_page description caps reading at three pages. This is a soft constraint the model respects most of the time, and it prevents the agent from burning through its iteration budget reading every search result. Third, save_note instructs the model to save notes incrementally. Without this, the model tends to hold everything in its reasoning and attempt a single massive synthesis at the end, which is fragile.

The Tool Implementations

Since web search and page reading require external APIs, we use mock implementations that return realistic data. The code is designed so you can swap in real implementations with minimal changes.

View code: tool implementations (mocks with swap instructions)

import json
import time

# Research notes accumulate here across iterations
research_notes: list[dict] = []


def web_search(query: str, timeout: int = 10) -> str:
    """Mock web search. Returns realistic results for any query.

    To use a real search API, replace this function body with:
        import httpx
        resp = httpx.get(
            "https://api.tavily.com/search",
            params={"query": query, "api_key": os.environ["TAVILY_API_KEY"]},
            timeout=timeout,
        )
        return resp.text
    """
    results = [
        {
            "title": f"Research findings on: {query}",
            "url": f"https://example.com/research/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"A comprehensive analysis of {query}. Key findings include "
                "measurable improvements in efficiency and documented trade-offs "
                "in implementation complexity."
            ),
        },
        {
            "title": f"Industry report: {query}",
            "url": f"https://example.com/report/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"2025 industry data on {query}. Survey of 500 practitioners "
                "reveals adoption rates of 34% in enterprise settings, with "
                "significant variation by sector."
            ),
        },
        {
            "title": f"Case study: {query} in practice",
            "url": f"https://example.com/case/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"Real-world implementation of {query} at a Fortune 500 company. "
                "Reduced manual effort by 60% over 18 months. Lessons learned "
                "and failure modes documented."
            ),
        },
    ]
    return json.dumps(results)


def read_page(url: str, timeout: int = 15) -> str:
    """Mock page reader. Returns realistic page content.

    To use a real page reader, replace this function body with:
        import httpx
        from bs4 import BeautifulSoup
        resp = httpx.get(url, timeout=timeout, follow_redirects=True)
        soup = BeautifulSoup(resp.text, "html.parser")
        return soup.get_text(separator="\\n", strip=True)[:5000]
    """
    return (
        f"Content from {url}\n\n"
        "Key findings from this source:\n"
        "1. Organizations that implemented structured approaches saw 40-60% "
        "improvement in target metrics over 12-18 months.\n"
        "2. The most common failure mode was insufficient stakeholder alignment, "
        "not technical complexity.\n"
        "3. Teams that measured outcomes from day one iterated 3x faster than "
        "teams that deferred measurement.\n"
        "4. Cost of implementation ranged from $50K-$500K depending on scope, "
        "with median ROI positive within 9 months.\n\n"
        "The study surveyed 200 organizations across financial services, "
        "healthcare, and technology sectors between 2023 and 2025."
    )


def save_note(claim: str, source_url: str, explanation: str) -> str:
    """Save a research note with source attribution."""
    note = {
        "claim": claim,
        "source_url": source_url,
        "explanation": explanation,
        "saved_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
    }
    research_notes.append(note)
    return json.dumps({"status": "saved", "total_notes": len(research_notes)})

The mock functions return enough realistic data for the agent to reason over meaningfully. The comments show exactly what to replace for a production implementation: Tavily for search, httpx plus BeautifulSoup for page reading. The save_note function is already production-ready; it just appends to an in-memory list.

The Agent Loop

This is the core of the agent. The loop calls Claude, dispatches tool calls, handles errors, tracks token usage, and enforces the iteration limit.

View code: the complete agent (one runnable file)

#!/usr/bin/env python3
"""Research assistant agent. Requires: pip install anthropic"""

import anthropic
import json
import sys
import time

# --- Configuration ---
MODEL = "claude-sonnet-4-20250514"
MAX_TOKENS = 4096
MAX_ITERATIONS = 10
TOKEN_WARNING_THRESHOLD = 80_000  # warn when input tokens approach this
TOOL_TIMEOUT = 15  # seconds

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

# --- Tool definitions (TOOLS list from above) ---
TOOLS = [
    {
        "name": "web_search",
        "description": (
            "Search the web for information on a topic. Returns a list of results, "
            "each with a title, URL, and snippet. Use this to find relevant sources "
            "for the research question. Prefer specific queries over broad ones."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query. Be specific and targeted.",
                }
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_page",
        "description": (
            "Read the content of a web page given its URL. Returns the extracted "
            "text content. Use this after web_search to read promising results in "
            "full. Do not read more than 3 pages per research question."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The full URL of the page to read.",
                }
            },
            "required": ["url"],
        },
    },
    {
        "name": "save_note",
        "description": (
            "Save a research note with source attribution. Use this to record key "
            "findings as you research. Each note should contain a single claim or "
            "finding, the source URL, and a brief explanation. Save notes as you go; "
            "do not wait until the end."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "claim": {
                    "type": "string",
                    "description": "The key finding or claim.",
                },
                "source_url": {
                    "type": "string",
                    "description": "The URL where this information was found.",
                },
                "explanation": {
                    "type": "string",
                    "description": "Brief context explaining the claim.",
                },
            },
            "required": ["claim", "source_url", "explanation"],
        },
    },
]

# --- Tool implementations ---
research_notes: list[dict] = []


def web_search(query: str, timeout: int = TOOL_TIMEOUT) -> str:
    """Mock web search. Replace body with Tavily API call for production."""
    results = [
        {
            "title": f"Research findings on: {query}",
            "url": f"https://example.com/research/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"A comprehensive analysis of {query}. Key findings include "
                "measurable improvements in efficiency and documented trade-offs."
            ),
        },
        {
            "title": f"Industry report: {query}",
            "url": f"https://example.com/report/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"2025 industry data on {query}. Survey of 500 practitioners "
                "reveals adoption rates of 34% in enterprise settings."
            ),
        },
        {
            "title": f"Case study: {query} in practice",
            "url": f"https://example.com/case/{query.replace(' ', '-')[:30]}",
            "snippet": (
                f"Real-world implementation of {query} at a Fortune 500 company. "
                "Reduced manual effort by 60% over 18 months."
            ),
        },
    ]
    return json.dumps(results)


def read_page(url: str, timeout: int = TOOL_TIMEOUT) -> str:
    """Mock page reader. Replace body with httpx + BeautifulSoup for production."""
    return (
        f"Content from {url}\n\n"
        "Key findings:\n"
        "1. Structured approaches yielded 40-60% improvement over 12-18 months.\n"
        "2. Most common failure mode: insufficient stakeholder alignment.\n"
        "3. Teams measuring from day one iterated 3x faster.\n"
        "4. Implementation cost: $50K-$500K, median ROI positive within 9 months.\n\n"
        "Survey of 200 organizations across financial services, healthcare, "
        "and technology sectors (2023-2025)."
    )


def save_note(claim: str, source_url: str, explanation: str) -> str:
    """Save a research note with source attribution."""
    note = {
        "claim": claim,
        "source_url": source_url,
        "explanation": explanation,
        "saved_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
    }
    research_notes.append(note)
    return json.dumps({"status": "saved", "total_notes": len(research_notes)})


# Map tool names to functions
TOOL_DISPATCH = {
    "web_search": lambda args: web_search(args["query"]),
    "read_page": lambda args: read_page(args["url"]),
    "save_note": lambda args: save_note(
        args["claim"], args["source_url"], args["explanation"]
    ),
}


def execute_tool(name: str, args: dict) -> str:
    """Execute a tool call with error handling and timeout tracking."""
    if name not in TOOL_DISPATCH:
        return json.dumps({"error": f"Unknown tool: {name}"})
    try:
        start = time.time()
        result = TOOL_DISPATCH[name](args)
        elapsed = time.time() - start
        if elapsed > TOOL_TIMEOUT:
            return json.dumps({
                "error": f"Tool {name} took {elapsed:.1f}s (limit: {TOOL_TIMEOUT}s)",
                "partial_result": result[:500] if result else None,
            })
        return result
    except Exception as e:
        return json.dumps({"error": f"Tool {name} failed: {str(e)}"})


def run_agent(question: str) -> str:
    """Run the research assistant agent on a question."""
    system_prompt = (
        "You are a research assistant. Your job is to answer the user's research "
        "question by searching the web, reading relevant pages, and taking notes "
        "with source attribution.\n\n"
        "Process:\n"
        "1. Search for relevant sources using web_search.\n"
        "2. Read the most promising pages using read_page (max 3 pages).\n"
        "3. Save key findings using save_note as you discover them.\n"
        "4. Once you have enough information (at least 3 notes from 2+ sources), "
        "synthesize a brief with citations.\n\n"
        "Rules:\n"
        "- Every claim in your final brief must cite a source URL.\n"
        "- If search results are insufficient, try a different query.\n"
        "- If a page fails to load, skip it and try the next result.\n"
        "- Be concise. The brief should be 200-400 words."
    )

    messages = [{"role": "user", "content": question}]
    total_input_tokens = 0
    total_output_tokens = 0

    for iteration in range(1, MAX_ITERATIONS + 1):
        print(f"\n--- Iteration {iteration}/{MAX_ITERATIONS} ---")

        # Call the LLM
        try:
            response = client.messages.create(
                model=MODEL,
                max_tokens=MAX_TOKENS,
                system=system_prompt,
                tools=TOOLS,
                messages=messages,
            )
        except anthropic.APIError as e:
            print(f"  API error: {e}")
            break

        # Track token usage
        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
        print(f"  Tokens: {response.usage.input_tokens} in, "
              f"{response.usage.output_tokens} out "
              f"(cumulative: {total_input_tokens} in, {total_output_tokens} out)")

        # Warn if approaching context limit
        if total_input_tokens > TOKEN_WARNING_THRESHOLD:
            print(f"  WARNING: Input tokens ({total_input_tokens}) approaching "
                  f"limit ({TOKEN_WARNING_THRESHOLD}). Agent should wrap up.")

        # Check stop reason
        if response.stop_reason == "end_turn":
            # Agent is done; extract final text
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text
            print(f"\n  Agent finished after {iteration} iterations.")
            print(f"  Total tokens: {total_input_tokens} in, "
                  f"{total_output_tokens} out")
            print(f"  Notes saved: {len(research_notes)}")
            return final_text

        if response.stop_reason != "tool_use":
            # Unexpected stop reason
            print(f"  Unexpected stop_reason: {response.stop_reason}")
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text
            return final_text or "Agent stopped unexpectedly."

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"  Tool call: {block.name}({json.dumps(block.input)[:80]}...)")
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        # Append assistant message and tool results to conversation
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    # If we hit max iterations, ask for a final synthesis
    print(f"\n  Max iterations ({MAX_ITERATIONS}) reached. Requesting synthesis.")
    messages.append({
        "role": "user",
        "content": (
            "You have reached the maximum number of research iterations. "
            "Synthesize your findings into a brief now, using the notes you "
            "have saved. Cite sources for every claim."
        ),
    })
    try:
        response = client.messages.create(
            model=MODEL, max_tokens=MAX_TOKENS,
            system=system_prompt, tools=TOOLS, messages=messages,
        )
        for block in response.content:
            if hasattr(block, "text"):
                return block.text
    except anthropic.APIError as e:
        return f"Final synthesis failed: {e}"

    return "Agent completed without producing a final brief."


# --- Entry point ---
if __name__ == "__main__":
    question = (
        sys.argv[1] if len(sys.argv) > 1
        else "What are the most effective approaches to implementing "
             "data quality monitoring in enterprise data platforms?"
    )
    print(f"Research question: {question}\n")
    brief = run_agent(question)
    print("\n" + "=" * 60)
    print("RESEARCH BRIEF")
    print("=" * 60)
    print(brief)

That is the entire agent in one file. Let me walk through the pieces that matter.

The Loop Explained

The core is a for loop bounded by MAX_ITERATIONS. This is not a while True. I used a bounded loop deliberately. An unbounded loop with a break condition inside is the most common source of runaway agents. The bounded loop guarantees termination.

for iteration in range(1, MAX_ITERATIONS + 1):
    response = client.messages.create(...)

    if response.stop_reason == "end_turn":
        return final_text  # Agent decided it's done

    if response.stop_reason != "tool_use":
        return "Agent stopped unexpectedly."  # Defensive

    # Process tool calls and continue

Three things happen on every iteration. The LLM is called with the full conversation history. The stop reason is checked: end_turn means the agent wants to return a final answer, tool_use means it wants to call a tool. Anything else is treated as an error. This three-way check handles the happy path, the completion path, and the defensive path in six lines.

Error Handling: The Part Tutorials Skip

The execute_tool function wraps every tool call in a try/except. If a tool crashes, the agent receives an error message and keeps running. This matters because real tool calls fail regularly. APIs time out. Pages return 403. JSON parsing breaks on unexpected HTML.

def execute_tool(name: str, args: dict) -> str:
    if name not in TOOL_DISPATCH:
        return json.dumps({"error": f"Unknown tool: {name}"})
    try:
        result = TOOL_DISPATCH[name](args)
        return result
    except Exception as e:
        return json.dumps({"error": f"Tool {name} failed: {str(e)}"})

When a tool returns an error, the LLM receives that error as a tool result. Claude is good at reading error messages and adapting. If read_page fails on a URL, the agent typically tries the next search result. If web_search fails, it reformulates the query. This graceful degradation is not automatic; it comes from the system prompt instruction: “If a page fails to load, skip it and try the next result.”

The execute_tool function also checks for unknown tool names. This handles the case where the model hallucinates a tool that does not exist. Without this check, you get a KeyError that crashes the entire agent.

Beyond Try/Except: Structured Error Responses

The basic execute_tool function catches errors and returns a generic JSON string: {"error": "Tool read_page failed: ConnectionTimeout"}. That is enough to keep the agent running, but it is not enough for the agent to make a good decision about what to do next. A connection timeout and an invalid URL are both errors, but they require opposite responses: retry the first, abandon the second.

Production agents need structured error responses. Every tool error should carry three pieces of information.

@dataclass
class ToolError:
    message: str           # Human-readable description
    errorCategory: str     # transient | validation | business | permission | not_found
    isRetryable: bool      # Can the LLM retry this exact call?
    suggestion: str = None # What the model should do next

The five error categories drive the agent’s recovery behavior:

Category	When It Applies	Retryable?	Agent Should…
transient	Timeouts, rate limits, temporary outages	Yes	Retry after a brief pause
validation	Bad input format, missing required field	No	Ask the user to correct the input
business	Valid request, but a business rule prevents it	No	Explain the constraint to the user
permission	Caller lacks access to the resource	No	Explain the access limitation
not_found	The requested entity does not exist	No	Ask the user to verify the identifier

Here is the upgraded execute_tool function:

def execute_tool(name: str, args: dict) -> tuple[str, bool]:
    """Execute a tool call. Returns (result_json, is_error)."""
    if name not in TOOL_DISPATCH:
        return json.dumps({
            "error": f"Unknown tool: {name}",
            "errorCategory": "validation",
            "isRetryable": False,
            "suggestion": f"Available tools: {', '.join(TOOL_DISPATCH.keys())}",
        }), True

    try:
        start = time.time()
        result = TOOL_DISPATCH[name](args)
        elapsed = time.time() - start

        if elapsed > TOOL_TIMEOUT:
            return json.dumps({
                "error": f"Tool {name} took {elapsed:.1f}s (limit: {TOOL_TIMEOUT}s)",
                "errorCategory": "transient",
                "isRetryable": True,
                "suggestion": "Retry with a simpler query or try a different source.",
            }), True

        return result, False

    except ConnectionError:
        return json.dumps({
            "error": f"Could not connect to {name} service.",
            "errorCategory": "transient",
            "isRetryable": True,
            "suggestion": "Retry after a brief pause.",
        }), True
    except ValueError as e:
        return json.dumps({
            "error": f"Invalid input: {str(e)}",
            "errorCategory": "validation",
            "isRetryable": False,
            "suggestion": "Check the input format and try with corrected values.",
        }), True
    except Exception as e:
        return json.dumps({
            "error": f"Tool {name} failed: {str(e)}",
            "errorCategory": "transient",
            "isRetryable": True,
        }), True

The is_error boolean returned alongside the result maps to the isError flag in tool results. When is_error is True, the model knows the result is an error, not data. Without this flag, the model treats error text as a successful result. It might say “Here are your results: connection timeout” and try to synthesize a brief from an error message.

”Checked and Found Nothing” vs. “Failed to Check”

The most important distinction in tool error handling is one that generic try/except misses entirely. Consider two scenarios for web_search("quantum computing applications in healthcare"):

The search runs successfully but returns zero results.
The search API is down and throws an exception.

Both produce “no results,” but they mean opposite things. Scenario 1 is a valid answer: there is nothing to find. Scenario 2 is a failure: we do not know whether there is something to find. The agent’s behavior should differ:

Valid empty result (is_error=False): “My search found no results for that query. Let me try a different search term.”
Access failure (is_error=True): “The search service is unavailable. I cannot complete this research right now.”

In the research agent, this distinction matters because the agent decides when it has “enough information” to synthesize. If it treats a failed search as “no results exist,” it might synthesize a brief claiming there is limited research on a topic when in fact it simply could not reach the search API.

When an error is retryable, the agent should not resubmit the identical request. That is blind retry, and it fails for the same reason the original call failed. Effective retry means modifying the request based on the specific error.

The pattern has three components: the original request, the failed result, and the specific error that caused the failure. All three go into the next attempt.

# Bad: blind retry (same input, same failure)
result = web_search("quantum computing healthcare")  # times out
result = web_search("quantum computing healthcare")  # times out again

# Good: retry with error feedback (simplified query)
result = web_search("quantum computing healthcare")  # times out
# Agent reads the error: transient, isRetryable, suggestion says "simpler query"
result = web_search("quantum computing medical")     # narrower query, succeeds

In the research agent, this behavior emerges from the system prompt combined with structured error responses. When the agent receives a transient error with a suggestion, it adapts its next tool call. When it receives a validation error, it asks the user for corrected input instead of retrying. The error structure replaces guesswork with a decision framework.

The basic error handling from the previous section keeps the agent alive. Structured error responses make it intelligent about recovery.

Context Management

Every iteration tracks token usage via response.usage.input_tokens and response.usage.output_tokens. When input tokens exceed the warning threshold (80,000), the agent prints a warning. This does not force the agent to stop, but it signals that the conversation history is growing large.

In a production agent, you would go further. Anthropic’s token counting API lets you check token counts before sending a request. You could add a hard cutoff that triggers synthesis when the context reaches 90% of the model’s limit. You could also implement context compression: summarizing older tool results instead of keeping the full text.

For a weekend build, the warning threshold plus the iteration limit is sufficient. The iteration limit is your primary safety net. Ten iterations with three tool calls each is thirty tool calls, which is more than enough for a research task.

Running It

Save the full code block above as research_agent.py and run:

pip install anthropic
export ANTHROPIC_API_KEY="your-key-here"
python research_agent.py "What are the best practices for AI governance in financial services?"

Here is representative output from a run:

Research question: What are the best practices for AI governance in financial services?

--- Iteration 1/10 ---
  Tokens: 1247 in, 156 out (cumulative: 1247 in, 156 out)
  Tool call: web_search({"query": "AI governance best practices financial services 2025"}...)

--- Iteration 2/10 ---
  Tokens: 2103 in, 203 out (cumulative: 3350 in, 359 out)
  Tool call: read_page({"url": "https://example.com/research/AI-governance-best-pract"}...)

--- Iteration 3/10 ---
  Tokens: 3012 in, 189 out (cumulative: 6362 in, 548 out)
  Tool call: save_note({"claim": "Structured AI governance approaches yielded 40-60%"}...)
  Tool call: save_note({"claim": "Most common failure mode was insufficient stakeholde"}...)

--- Iteration 4/10 ---
  Tokens: 3845 in, 167 out (cumulative: 10207 in, 715 out)
  Tool call: read_page({"url": "https://example.com/report/AI-governance-best-pract"}...)

--- Iteration 5/10 ---
  Tokens: 4521 in, 201 out (cumulative: 14728 in, 916 out)
  Tool call: save_note({"claim": "34% adoption rate in enterprise settings for formal"}...)

--- Iteration 6/10 ---
  Tokens: 5102 in, 312 out (cumulative: 19830 in, 1228 out)

  Agent finished after 6 iterations.
  Total tokens: 19830 in, 1228 out
  Notes saved: 3

============================================================
RESEARCH BRIEF
============================================================
Based on research across industry reports and case studies, here are
the key findings on AI governance in financial services...

[Sources cited inline with URLs]

The agent searched once, read two pages, saved three notes, and synthesized a brief. Six iterations, under 20K input tokens. It decided on its own when it had enough information. That decision came from the system prompt instruction: “Once you have enough information (at least 3 notes from 2+ sources), synthesize a brief.”

What Can Go Wrong

Every failure mode in this agent has a corresponding defense. Here is the mapping.

Failure Mode	What Happens	How the Code Handles It
Tool crash	`read_page` raises an exception (timeout, bad URL, DNS failure)	`execute_tool` catches the exception and returns a structured error with category and retry guidance. The LLM reads the category and adapts: retries transient failures, abandons validation errors.
Unknown tool	The model hallucinates a tool name like `analyze_data`	`execute_tool` checks `TOOL_DISPATCH` and returns a validation error with a list of available tools. The LLM picks a valid tool on the next iteration.
Empty result vs. failure	`web_search` returns zero results, or the search API is unreachable	Structured errors distinguish `is_error=False` (valid empty result: nothing found) from `is_error=True` (service failure: could not check). The agent adjusts its synthesis accordingly.
Runaway loop	The agent keeps searching without synthesizing	`MAX_ITERATIONS = 10` caps the loop. If reached, the agent is forced to synthesize with whatever notes it has.
Context overflow	Conversation grows too long for the model	`TOKEN_WARNING_THRESHOLD` prints a warning. In production, add a hard cutoff that triggers immediate synthesis.
Malformed response	`stop_reason` is neither `end_turn` nor `tool_use`	The defensive branch returns whatever text the model produced, or a fallback message.
API failure	Anthropic API returns a 500, rate limit, or network error	`anthropic.APIError` catch breaks the loop and returns what the agent has so far.
Empty synthesis	The model finishes but returns no text	The final fallback: “Agent completed without producing a final brief.”

The defense that matters most is the bounded loop. A while True loop with the wrong break condition can run indefinitely, burning tokens and money. The for loop with MAX_ITERATIONS makes termination a structural guarantee, not a behavioral hope.

Adding a Basic Eval

Building the agent is half the work. The other half is knowing whether the output is any good. Hamel Husain’s eval framework says it clearly: “Evals tell you whether the output is good.” Without them, you are judging by feel.

Here is a basic eval that checks three properties of the research brief.

View code: basic eval for the research agent

import re


def eval_research_brief(brief: str, notes: list[dict]) -> dict:
    """Score a research brief on three dimensions.

    Returns a dict with scores and a pass/fail verdict.
    """
    scores = {}

    # 1. Source count: how many unique URLs are cited in the brief?
    urls_in_brief = set(re.findall(r"https?://[^\s\)\"]+", brief))
    scores["sources_cited"] = len(urls_in_brief)
    scores["sources_pass"] = len(urls_in_brief) >= 2

    # 2. Note coverage: what fraction of saved notes appear in the brief?
    notes_referenced = 0
    for note in notes:
        # Check if the claim (or a substantial substring) appears in the brief
        claim_words = note["claim"].lower().split()
        # Match if at least 60% of claim words appear in the brief
        brief_lower = brief.lower()
        matches = sum(1 for w in claim_words if w in brief_lower)
        if matches / max(len(claim_words), 1) >= 0.6:
            notes_referenced += 1
    scores["notes_referenced"] = notes_referenced
    scores["notes_total"] = len(notes)
    scores["coverage_pass"] = (
        notes_referenced / max(len(notes), 1) >= 0.5
    )

    # 3. Length check: is the brief within the target range?
    word_count = len(brief.split())
    scores["word_count"] = word_count
    scores["length_pass"] = 100 <= word_count <= 600

    # Overall verdict
    scores["passed"] = all([
        scores["sources_pass"],
        scores["coverage_pass"],
        scores["length_pass"],
    ])

    return scores


# Run the eval after the agent completes
if __name__ == "__main__":
    # ... (after run_agent returns)
    eval_result = eval_research_brief(brief, research_notes)
    print("\n" + "=" * 60)
    print("EVAL RESULTS")
    print("=" * 60)
    for key, value in eval_result.items():
        print(f"  {key}: {value}")
    print(f"\n  Verdict: {'PASS' if eval_result['passed'] else 'FAIL'}")

The eval checks three things. First, does the brief cite at least two distinct source URLs? A brief without citations is not research; it is the model’s training data repackaged. Second, does the brief reference at least half of the notes the agent saved? If the agent saved four notes but the brief only uses one, information was lost in synthesis. Third, is the brief within the target word count? Too short means insufficient detail. Too long means the model is padding.

This eval is simple. It does not check factual accuracy, logical coherence, or writing quality. Article 6 in this series covers how to build serious evals with LLM-as-a-Judge scoring and regression test suites. But this basic eval catches the most common failure modes: missing citations, dropped findings, and incorrect length. You can run it on every agent invocation in under a second. Start here.

Connecting to the Series

This agent is a direct application of the principles established in the first three articles.

From Article 1 (What Is an AI Agent): The agent has all four components. The LLM (Claude) is the reasoning engine. The tools (web_search, read_page, save_note) are the hands. Memory accumulates in the messages list and the research_notes array. The loop iterates until the goal is met or the iteration limit is reached.

From Article 3 (Pike’s Rules): We applied Rule 3 directly: start simple. One agent, one system prompt, three tools. No orchestration framework. No multi-agent routing. No vector database. The code uses the raw Anthropic SDK because Anthropic’s own guidance recommends starting without a framework. If this agent proves insufficient for your use case, you will know exactly which part needs more sophistication, because you understand every line.

Looking ahead: Article 5 explains why the data entering the agent’s context window matters more than the model. Notice how the system prompt structures what the agent pays attention to and how it processes tool results. That is Context Engineering in practice. Article 6 shows how to build evaluation pipelines that go beyond the basic eval we wrote here. Article 7 adds the safety layer: what happens when tool results contain prompt injection, stale data, or contradictions.

The intent was not to build the best possible research agent. It was to build one that demonstrates every concept the series teaches, in working code, so the remaining articles have a concrete artifact to reference.

Do Next

Priority	Action	Why it matters
No experience	Copy the full agent code, install the anthropic package, set your API key, and run it with the default question. Read the iteration output. Watch the agent decide which tool to call and when to stop.	Reading agent code is different from watching it run. The iteration log shows you the decision loop in real time: search, read, note, synthesize. That loop is the concept from Article 1 made concrete.
No experience	Change the research question to something you actually want to know about. Run it again. Compare the two briefs.	A research agent on a topic you care about surfaces the quality questions immediately: did it find good sources? Did it miss something obvious? Did the brief make sense? These reactions are the start of building evaluation instincts.
Learning	Replace the mock `web_search` with a real search API (Tavily has a free tier). Run the agent on three different questions and compare the output quality.	Mock data produces predictable results. Real data produces surprising failures: irrelevant search results, pages that cannot be parsed, sources that contradict each other. These failures teach you more about agent reliability than any tutorial.
Learning	Add a fourth tool: `check_claim` that takes a claim and a source URL and verifies whether the source actually supports the claim. Wire it into the system prompt.	This is reasoning-layer validation from Article 7, applied inside the agent itself. You are building a self-checking agent, which previews the self-improvement patterns in Article 8.
Practitioner	Adapt the agent loop and error handling patterns to your domain. Replace the three research tools with tools relevant to your work: a Data Catalog lookup, a metric query, a schema validator. Keep the loop structure, the error handling, and the eval.	The architecture is domain-agnostic. The tools are domain-specific. The loop, error handling, and eval patterns transfer directly to any agent you build. Swap the tools, keep the skeleton.
Practitioner	Run the basic eval on 20 different research questions. Track pass rates across the three dimensions (sources, coverage, length). Identify which dimension fails most often.	Twenty runs with tracked metrics is a minimal eval suite. You will discover patterns: the agent consistently cites only one source, or it drops notes during synthesis, or it overshoots the word count. Each pattern points to a specific prompt or tool description improvement.

This is Part 4 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Pike’s Five Rules for Agent Development · Next: Context Is the Program →

Sources & References

Anthropic: Tool Use with Claude(2025)
Anthropic: Building Effective Agents(2024)
Anthropic Python SDK(2025)
Tavily Search API(2025)
Simon Willison: Agentic Engineering Patterns(2026)
Hamel Husain: Evals Skills for Coding Agents(2026)
Karpathy AutoResearch(2026)
Anthropic: Message Batches and Token Counting(2025)