AI Products & Strategy March 25, 2026 · 17 min read

From Problem to Agent: An Implementation Reference Guide

The series taught ten concepts across ten articles. This capstone walks through all of them applied to one problem: building a Data Quality monitoring agent. Seven steps, from problem definition through production deployment, showing the decision-making process that separates agent projects that ship from agent projects that stall.

By Vikas Pratap Singh

#ai-agents #data-quality #implementation-guide #context-engineering #agent-architecture #agentic-engineering

Executive Briefing

What this covers: A seven-step decision framework for building an AI agent, applied end-to-end to a Data Quality monitoring agent. Each step maps to a previous article in the series.
Who should read it: Data architects, platform engineers, AI product owners, and anyone who has read the earlier articles and wants to see the concepts composed into a single implementation.
Key takeaway: Building a good agent is 20% coding and 80% decision-making. The seven steps force you to answer the hard questions before you write the first line of code: Should this even be an agent? What context does it need? How will you know if it works?
Why it matters now: Most agent projects fail not from bad models but from skipped steps. This guide makes every step explicit so you can audit your own process against it.

Part 12 of 12: The Practitioner’s Guide to AI Agents

A Midnight Page and Three Months of Bad Data

Early in my career, I was on the run team for a U.S. healthcare client when a page came in at midnight. The data in one of the fact tables was breaking downstream dashboards and reports. Business users could see the numbers were off.

My first instinct was to run the reconciliation queries my team and I used for exactly this kind of call. Nothing came back as incorrect. Row counts matched. Schema was fine. The standard checks all passed.

Deeper investigation across multiple time snapshots told a different story. The data had been coming in wrong for over three months. A mapping change deployed three months earlier had introduced an error that unit testing did not catch. The column mapped to a slightly different source field, producing values that were structurally valid but semantically wrong. Daily reconciliation never flagged it because the format and volume were unchanged.

The only reason it surfaced after three months was a quarterly report. That report used the affected column for a specific analysis, and the quarterly numbers were visibly off. Business users caught what our monitors missed.

That incident cost weeks of investigation, evidence production, and rework. It also crystallized a question I have been thinking about ever since: what would it take to build a system that catches these failures before a quarterly report does? Not a threshold alert. Not a reconciliation query that checks format and volume. A system that reasons about whether data looks right, compares it against what it should look like, and tells you in plain language what is wrong and how urgent it is.

That system is an agent. And building it is the problem this article walks through, step by step, using every concept from the ten articles that came before.

The Seven Steps

This series taught concepts one at a time. Agent definition. Design principles. Building. Prompt specification. Context quality. Evals. Guardrails. Observability. Self-improvement. Each article stands alone, but they compose into a decision-making process. This article traces that process from start to finish on a single problem.

The seven-step decision flow from problem definition through production deployment, with each step mapped to its corresponding series article. Article references updated: Art 1 (definition), Art 2 (decision), Art 3 (design), Art 5-6 (prompts and context), Art 7 (evals), Art 8 (guardrails), Art 9-10 (observability and self-improvement)

The problem: build a Data Quality monitoring agent that connects to data sources, checks freshness and schema conformance, monitors value distributions for semantic drift, compares all results against expected baselines, detects anomalies, and produces a daily brief with findings and severity levels. The agent needs to know when to escalate and when to auto-resolve. The distribution monitoring is the capability that separates this agent from a script: it catches the class of failure that hid in that fact table for three months.

Here is how you get from that problem statement to a production system.

Step 1: Define the Problem (Article 1)

Article 1 established the four components every agent shares: LLM, tools, memory, and loop. The first step in any agent project is mapping your problem onto those four components.

LLM (reasoning engine): The model interprets anomalies. A 15% row count drop could be a data loss event or a normal weekend pattern. A schema change could be a planned migration or an upstream break. A shift in value distribution could be a mapping error or a legitimate change in the source data. The LLM reads the check results, compares them against baseline expectations, correlates signals across multiple checks, and writes a natural language interpretation with severity. This is the component that would have caught the fact table issue: it reasons about whether data makes sense, not just whether it exists.

Tools (data source connectors): The agent needs functions to query data sources (a warehouse health check, an API ping, a file system scan), read a configuration file with expected baselines, and send alerts through Slack, email, or PagerDuty.

Memory (baseline expectations): The agent stores expected row counts, schema signatures, freshness thresholds, value distribution baselines, and historical check results. This memory lets it distinguish “different from yesterday” (possibly fine) from “different from every day this quarter” (probably broken). The distribution baselines are especially important: without them, the agent has no reference point for “what does normal look like for this column?”

Loop (iterate across sources): The agent checks one data source at a time, accumulates findings, and decides after each check whether to continue, escalate immediately, or mark a source as healthy. The loop terminates when all sources are checked or when a critical issue triggers early escalation.

Article 1’s spectrum table answers a positioning question: where does this agent sit? It is an Agent, not an Autonomous Agent. It runs daily on a schedule with human oversight of alerts. It does not run continuously for days. It does not set its own sub-goals. A human reviews the daily brief and decides what to act on. That positioning determines the guardrail strategy later.

Step 2: Should This Be an Agent? (Article 2)

Article 2 provides the decision framework. Before writing code, answer: does this problem actually need an agent, or would a simpler solution work?

The case for a script: A cron job that runs SQL queries, checks row counts against thresholds, and fires alerts on violations. Simple, reliable, well-understood. Most Data Quality monitoring in production today works this way. Tools like Great Expectations and Soda do exactly this.

Where a script falls short: Go back to the fact table incident from the opening. Row counts were correct. Schema was unchanged. Freshness was fine. The failure was semantic: a mapping change caused a column to pull from a slightly different source field, producing values that were structurally valid but meaningfully wrong. A threshold-based monitor checks format and volume. It does not ask “do these values make sense compared to what this column has historically contained?” That question requires reasoning, not a WHERE clause.

The decision criteria, applied to this specific problem:

Does the task require reasoning over ambiguous signals? Yes. The fact table values were within plausible ranges individually. The wrongness only became visible when you compared distributions over time or cross-referenced values against what the source system was actually sending. A script checks whether a value exists and falls within a threshold. An agent reasons about whether a value is consistent with what it should be.
Does the agent need to adapt its approach based on what it finds? Yes. If the agent detects a distribution shift in one column of the fact table, it should check whether related columns show a correlated shift (suggesting a mapping change) or whether the shift is isolated (suggesting a source data change). A script runs the same checks regardless of what earlier checks found.
Does the output benefit from natural language summarization? Yes. “claims_fact.procedure_code distribution shifted 23% from historical baseline starting March 3, consistent with a mapping change rather than a source data change. Three downstream reports use this column.” That tells a data engineer what happened, when it started, and what is affected. A Slack alert reading “WARN: procedure_code distribution drift” does not.

The verdict: this should be an agent. A script handles the structural checks (row counts, schema, freshness). Those are the 80% case. The agent handles the semantic checks: does the data make sense, and if not, what changed and how urgent is it? That 20% is where the fact table incident lived for three months undetected.

Step 3: Apply Design Principles (Article 3)

Article 3 mapped Pike’s five rules to agent development. Two rules dominate this step.

Rule 3 (start simple): Do not build a multi-source, multi-dimension monitoring agent on day one. Start with one data source, one quality dimension, and one output format (a text summary). Get that working and measured before expanding.

Which quality dimension first? Not freshness. Freshness is the easiest to check, but it is not why you are building an agent. A cron job checks freshness just fine. Start with value distribution, the semantic check that would have caught the fact table incident. That is the dimension that justifies the agent’s existence. Once distribution monitoring works on one table, add freshness and row counts. Those are the easy wins that round out coverage.

The temptation is to build the complete system: ten data sources, six quality dimensions, automated remediation, Slack integration, PagerDuty escalation. Resist it. Anthropic’s agent-building guide recommends the same: start with the simplest architecture that could possibly work. Add complexity only when measurement proves it necessary.

Rule 5 (data dominates): The data source metadata (freshness expectations, schema contracts, historical baselines) IS the context. The quality of that metadata determines the quality of the agent’s reasoning. If your baseline config says a table updates hourly but it actually updates daily, every freshness check will fire a false alarm.

Get the config right first. Interview the data owners. Check the actual update patterns against what the documentation says. The time you spend validating your baseline metadata will save you ten times that in false positive investigation later.

Step 4: Engineer the Context (Article 6)

Article 6 introduced the five criteria from the Context Engineering paper and the context placement tactics for positioning critical data within the window. Here is how each one applies to the Data Quality agent.

Relevance: Only load metadata for the sources being checked today. If the agent checks three of your twenty tables, do not dump all twenty table definitions into the context window. Irrelevant context degrades performance on the relevant context.

Sufficiency: Include historical baselines so the agent can compare. A single day’s row count means nothing in isolation. The agent needs the last 30 days of summary statistics (mean, standard deviation, day-of-week pattern) to determine whether today’s number is anomalous. For value distribution checks, it needs the baseline frequency distribution for categorical columns and the historical mean/percentiles for numeric columns. Without that baseline, the agent cannot distinguish “the procedure_code distribution looks different today” from “the procedure_code distribution has always looked like this.”

Isolation: Do not mix alert history with baseline data. The agent’s reasoning about whether today’s check result is anomalous should not be contaminated by whether yesterday’s alert was acknowledged or dismissed. Those are different decisions requiring different context.

Economy: Compress historical data to summary statistics. The agent does not need 30 individual daily row counts. It needs: mean = 142,000, stddev = 8,200, weekend_mean = 98,000, weekend_stddev = 5,100. This reduces token cost by an order of magnitude while preserving the information the agent needs for its comparison.

Provenance: Tag every check result with the source system, the timestamp of the check, and the timestamp of the data itself. When the agent reports “orders table is 36 hours stale,” a human reviewing the brief needs to know: when was the check run, and what was the last update timestamp? Without provenance, the brief is not auditable.

Here is what the context structure looks like:

Sample context config (YAML)

sources:
  - name: "orders"
    type: "warehouse_table"
    connection: "snowflake://analytics/public/orders"
    checks:
      freshness:
        expected_update_frequency: "hourly"
        max_stale_hours: 4
      row_count:
        baseline_mean: 142000
        baseline_stddev: 8200
        weekend_mean: 98000
        weekend_stddev: 5100
        alert_threshold_stddev: 3
      schema:
        expected_columns:
          - name: "order_id"
            type: "VARCHAR"
          - name: "created_at"
            type: "TIMESTAMP_NTZ"
          - name: "total_amount"
            type: "NUMBER(12,2)"
    severity_rules:
      freshness_critical_hours: 8
      row_count_critical_stddev: 5
      schema_change: "always_critical"

  - name: "claims_fact"
    type: "warehouse_table"
    connection: "snowflake://analytics/healthcare/claims_fact"
    checks:
      freshness:
        expected_update_frequency: "daily"
        max_stale_hours: 36
      row_count:
        baseline_mean: 2400000
        baseline_stddev: 120000
        alert_threshold_stddev: 2
      value_distribution:
        monitored_columns:
          - name: "procedure_code"
            type: "categorical"
            baseline_top_10_pct: [0.18, 0.12, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.04, 0.03]
            drift_threshold_pct: 15
          - name: "paid_amount"
            type: "numeric"
            baseline_mean: 342.50
            baseline_p95: 1850.00
            drift_threshold_pct: 20
    severity_rules:
      freshness_critical_hours: 48
      row_count_critical_stddev: 4
      distribution_drift: "always_critical"

escalation:
  critical: "pagerduty"
  warning: "slack:#data-quality"
  info: "daily_brief_only"

This config is the context. Article 5 covers the prompt specification patterns that make the agent’s tool definitions produce consistent, structured output: explicit criteria for severity levels, nullable fields for optional check results, and enum-with-fallback for anomaly categories that do not fit the standard taxonomy. Apply those patterns to the tool schemas for each check type.

This config is also the case facts block from Article 6’s context placement tactics. Pin it at the top of the context, never summarize it, and mark it immutable. Every decision the agent makes traces back to values in this file. When the agent says “claims_fact.procedure_code distribution shifted 23% from baseline, severity: critical,” you can verify that against drift_threshold_pct: 15 and distribution_drift: "always_critical". When it says “claims_fact freshness is 40 hours, severity: warning,” you can check max_stale_hours: 36 and freshness_critical_hours: 48. The config makes the reasoning auditable.

Notice that the value_distribution check is what would have caught the fact table incident. The mapping change shifted procedure_code frequencies away from the historical baseline. Row counts and schema would have passed. Freshness would have passed. But a distribution comparison over the trailing 30 days would have flagged the shift within the first daily run after the mapping change deployed.

Step 5: Build Evals (Article 7)

Article 7 established the eval pyramid and the validation patterns that catch problems during execution, not just after. For the Data Quality agent, “correct” means three things: the agent detects real anomalies, it does not flag normal variation as anomalous, and it assigns the right severity.

Build a test suite by injecting known anomalies into a test dataset and verifying the agent catches them.

Eval test suite for the Data Quality agent

"""Eval suite: inject known anomalies and verify detection."""

TEST_CASES = [
    {
        "name": "stale_data_critical",
        "source": "orders",
        "injected_anomaly": {"last_updated": "40 hours ago"},
        "expected_detection": True,
        "expected_severity": "critical",
    },
    {
        "name": "stale_data_warning",
        "source": "orders",
        "injected_anomaly": {"last_updated": "5 hours ago"},
        "expected_detection": True,
        "expected_severity": "warning",
    },
    {
        "name": "normal_freshness",
        "source": "orders",
        "injected_anomaly": {"last_updated": "45 minutes ago"},
        "expected_detection": False,
        "expected_severity": None,
    },
    {
        "name": "row_count_drop_weekday",
        "source": "orders",
        "injected_anomaly": {"row_count": 85000, "day": "Tuesday"},
        "expected_detection": True,
        "expected_severity": "critical",
        "notes": "85k is ~7 stddev below weekday mean of 142k",
    },
    {
        "name": "row_count_drop_weekend",
        "source": "orders",
        "injected_anomaly": {"row_count": 85000, "day": "Saturday"},
        "expected_detection": True,
        "expected_severity": "warning",
        "notes": "85k is ~2.5 stddev below weekend mean of 98k",
    },
    {
        "name": "schema_change_column_added",
        "source": "orders",
        "injected_anomaly": {"schema_diff": "column 'discount_code' added"},
        "expected_detection": True,
        "expected_severity": "critical",
    },
    {
        "name": "schema_unchanged",
        "source": "orders",
        "injected_anomaly": {"schema_diff": None},
        "expected_detection": False,
        "expected_severity": None,
    },
    {
        "name": "distribution_drift_mapping_change",
        "source": "claims_fact",
        "injected_anomaly": {
            "column": "procedure_code",
            "top_10_pct": [0.31, 0.14, 0.09, 0.06, 0.05, 0.04, 0.03, 0.03, 0.02, 0.02],
        },
        "expected_detection": True,
        "expected_severity": "critical",
        "notes": "Top category jumped from 18% to 31%, simulating a mapping change",
    },
    {
        "name": "distribution_normal_variation",
        "source": "claims_fact",
        "injected_anomaly": {
            "column": "procedure_code",
            "top_10_pct": [0.19, 0.11, 0.10, 0.08, 0.07, 0.06, 0.05, 0.04, 0.04, 0.03],
        },
        "expected_detection": False,
        "expected_severity": None,
        "notes": "Minor variation within 15% drift threshold",
    },
    {
        "name": "numeric_distribution_shift",
        "source": "claims_fact",
        "injected_anomaly": {
            "column": "paid_amount",
            "mean": 428.00,
            "p95": 2200.00,
        },
        "expected_detection": True,
        "expected_severity": "critical",
        "notes": "Mean shifted 25% from baseline 342.50, exceeds 20% threshold",
    },
]


def run_eval(agent_fn, test_cases: list[dict]) -> dict:
    """Run the agent against each test case and score results."""
    results = {"passed": 0, "failed": 0, "details": []}

    for case in test_cases:
        output = agent_fn(case["source"], case["injected_anomaly"])
        detection_correct = output["detected"] == case["expected_detection"]
        severity_correct = (
            output.get("severity") == case["expected_severity"]
            if case["expected_detection"]
            else True
        )
        passed = detection_correct and severity_correct
        results["passed" if passed else "failed"] += 1
        results["details"].append({
            "name": case["name"],
            "passed": passed,
            "detection_correct": detection_correct,
            "severity_correct": severity_correct,
        })

    total = len(test_cases)
    results["detection_rate"] = results["passed"] / total
    results["false_positive_rate"] = sum(
        1 for d in results["details"]
        if not d["passed"] and not d["detection_correct"]
    ) / total
    return results

Three metrics matter here:

Metric	Target	Why
Detection rate	> 95%	The agent must catch real anomalies. A missed critical failure is worse than a false alarm.
False positive rate	< 10%	Too many false alarms train teams to ignore alerts. This is the “alert fatigue” problem that kills monitoring systems.
Severity accuracy	> 90%	Calling a warning “critical” wastes on-call time. Calling a critical “warning” delays response. Severity matters.
Distribution drift detection	> 90%	This is the metric that separates the agent from a script. The fact table incident was a distribution drift. If the agent misses these, it fails at its core purpose.

Run this eval suite on every change to the agent’s prompt, context config, or tool definitions. The eval is your regression safety net. It catches degradation before users do, which is exactly what the reconciliation queries failed to do for three months on that fact table.

Step 6: Add Guardrails (Article 8)

Article 8 defined three guardrail layers, escalation patterns for when the agent hits a boundary it cannot cross, and workflow gates for enforcing prerequisites. Here is how each layer applies.

Layer 1 (Input guardrails): Validate the config file before the agent reads it. Is the YAML well-formed? Do all referenced sources have connection strings? Are threshold values within sane ranges (no negative row counts, no freshness threshold of 0 hours)? A corrupted config is the simplest way to break the agent, and the easiest to prevent.

Layer 2 (Reasoning guardrails): Validate tool results before they enter the context window. When the agent queries Snowflake for a row count, is the returned value a number? Is the timestamp parseable? Did the query actually succeed, or did it return an error that the connector wrapped in a valid-looking response? For distribution checks, are the returned frequencies non-negative and summing to 1.0? An invalid distribution entering the context window will produce nonsensical drift calculations that the LLM may still narrate confidently. This is the missing middle that most frameworks skip.

Layer 3 (Output guardrails): Validate the daily brief before it reaches recipients. Does every finding cite a specific source and check result? Does every severity assignment reference a threshold from the config? Is the brief structured correctly (summary at top, details below, escalation actions clearly marked)?

Now apply the lethal trifecta check from Simon Willison:

Trifecta leg	Present?	Mitigation
Private data access	Yes. The agent reads from production data sources.	Restrict the agent to read-only access. Use a service account with SELECT permissions only.
Untrusted content	Possible. If data sources include user-generated content, tool results could contain injection patterns.	Run injection detection on tool results, not just user input.
External communication	Yes. The agent sends alerts via Slack and PagerDuty.	Restrict alert channels to predefined destinations. Do not let the agent compose arbitrary HTTP requests. Hard-code the webhook URLs.

All three legs are present. The mitigations reduce the risk surface: read-only access prevents data modification, injection scanning on tool results catches prompt injection through data, and hard-coded alert channels prevent exfiltration.

Step 7: Plan for Improvement (Articles 10-11)

Article 10 covers when to move to multi-agent systems and the orchestration patterns that work. Article 11 defined the spectrum from static to self-improving agents.

Start at Level 2 (parameterized). Externalize thresholds, the source list, the check schedule, and the severity rules into the YAML config. The agent reads this config on every run. A human edits it when needs change. This is the right starting point. Do not build a self-improving Data Quality agent on day one.

Plan for Level 3 after 30 days. Once the agent has 30 days of daily runs, analyze which alerts were acted on and which were dismissed. If the team consistently dismisses row count warnings for the orders table on Saturdays, the weekend baseline is probably wrong. The agent can propose an updated weekend_mean based on the actual data. For distribution checks, 30 days of data reveals whether the baseline itself needs recalibrating: maybe the procedure_code distribution legitimately shifted after a new claims category was added, and the old baseline is now producing daily false positives. A human reviews and approves the proposed change.

Inner loop (daily execution): Read config, run checks, produce brief, log results. The inner loop does not change anything about the agent’s own behavior.

Outer loop (monthly threshold review): Read 30 days of alert history, identify patterns (consistent false positives, missed true positives), propose config updates. The human approves before any config change takes effect. This is the same inner/outer loop pattern from the briefing agent example in Article 8.

Production Considerations

The seven steps get you to a working agent. Getting it to production requires four more decisions.

Deployment. A cron job or scheduled Lambda that triggers the agent daily. For AWS: a Lambda function triggered by an EventBridge schedule, reading the config from S3, writing the brief to S3, and posting alerts via SNS. Keep the infrastructure simple. The agent is the complex part; the deployment should not be.

Cost. Estimate tokens per run. The context config above is roughly 500 tokens. Historical baselines for 10 sources add another 2,000. Check results add 200-500 per source. A 10-source agent run consumes roughly 7,000-10,000 input tokens and 1,000-2,000 output tokens.

At current Claude Sonnet pricing, that is approximately $0.03-0.05 per daily run, or roughly $1/month. Cost is not a constraint for a daily agent. It becomes one if you run hourly or per-event.

Monitoring and Observability. Article 9 covers the five dimensions of agent observability. For the Data Quality agent, the most important are execution tracing (replay any daily run step by step to see why the agent flagged or missed an anomaly), token economics (track cost per run to ensure the agent stays within budget), and behavioral drift detection (compare this week’s alert patterns against the four-week baseline to catch silent degradation). Track three operational metrics: agent execution time (is it completing within the cron window?), alert volume per day (is alert fatigue building?), and false positive rate over trailing 30 days (is the config drifting?). These are meta-metrics: they measure the health of the monitoring system itself.

Rollback. Version the config in git. Every config change produces a commit. If a threshold update causes a flood of false positives, revert the commit and re-deploy. Rollback should take less than five minutes. If it takes longer, your deployment is too complex.

The Decision Was the Hard Part

This article walked through seven steps and roughly 2,500 words of reasoning to reach a system that, once decided, takes a weekend to build. The implementation itself, connecting to data sources, running checks, producing a brief, is straightforward engineering. Article 4 shows the mechanics.

The hard part was the decisions. Should this be an agent? (Step 2 forced us to articulate exactly why the fact table incident needed reasoning, not just thresholds.) What context does it need? (Step 4 prevented us from dumping every table definition into the window and hoping.) How will the prompts produce consistent output? (The prompt specification patterns from Article 5 turned vague severity labels into explicit, testable criteria.) How do you know if it works? (Step 5 gave us a regression suite that includes the exact class of failure, distribution drift from a mapping change, that went undetected for three months.) What can go wrong? (Step 6 mapped the attack surface before we exposed it.) How will you see what the agent actually does in production? (Article 9’s observability dimensions gave us execution tracing and drift detection before the first deployment.)

Gartner predicts that over 40% of agentic AI projects will be canceled by 2027. The canceled projects will not fail because models were not capable enough. They will fail because teams skipped the decision-making and jumped to coding.

These seven steps are not a framework to memorize. They are a checklist to run. Apply them to your next agent project, and you will spend more time thinking and less time debugging.

Do Next

Tier	Action	Why it matters
No experience	Read Articles 1 and 2 first. Come back here when you can name the four agent components without looking them up.	This article assumes the vocabulary. The earlier articles build it.
No experience	Identify one Data Quality problem you have encountered at work. Write down: what signal would have caught it early? Would that signal require reasoning, or just a threshold?	The answer to “reasoning or threshold?” determines whether the problem needs an agent or a script. That is Step 2.
Learning	Pick one data source at work. Write the YAML config entry for it: expected freshness, baseline row count, schema contract. Be honest about what you actually know vs. what you assume.	The gap between what you know and what you assume is where false positives live. The config forces precision.
Learning	Write three eval test cases for that data source: one normal result, one anomaly the agent should catch, one edge case (weekend, holiday, end-of-quarter).	Three test cases are not enough for production. They are enough to start. The practice of writing expected outcomes before running the agent changes how you think about agent quality.
Practitioner	Run the full seven-step process on your next agent project. Document each step’s decision, not just the output. Share the document with your team.	The decisions are more valuable than the code. A team that understands why the agent was designed this way can maintain and evolve it. A team that only has the code will rebuild it from scratch when requirements change.
Practitioner	After 30 days of running a Data Quality agent, analyze which alerts were acted on vs. dismissed. Propose threshold adjustments based on the data.	This is the transition from Level 2 to Level 3. The data tells you which thresholds are wrong. Your judgment tells you which adjustments are safe.

This is Part 12 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: The Self-Improving Agent · Start from the beginning with What Is an AI Agent? or find your entry point in the series guide.