AI Governance & Safety March 21, 2026 · 11 min read

Your AI Agent Has a Data Quality Problem and No One Is Checking

AI agents trust every tool response they receive, with no standardized quality controls between tool-calling outputs and LLM reasoning. This article maps the 6 traditional Data Quality dimensions onto the context window, exposing the most consequential unmonitored data pipeline in enterprise AI.

By Vikas Pratap Singh

#ai-agents #data-quality #context-window #ai-governance #tool-calling #agent-safety

Executive Briefing

What this covers: Why the context window is the newest and most consequential data pipeline in enterprise AI, and why it has no standardized Data Quality controls between tool outputs and LLM reasoning.
Who should read it: Data leaders, AI platform engineers, product managers, and risk officers deploying or evaluating agentic AI systems.
Key takeaway: The six Data Quality dimensions that enterprises spent two decades building into warehouses and pipelines (Accuracy, Completeness, Timeliness, Consistency, Validity, Uniqueness) have no standardized equivalent inside the context window. Most agents operate without input quality visibility.
The uncomfortable truth: Across 1,563 contaminated tool-output turns and 7 LLMs, no agent ever questioned tool-data reliability. If your agent cannot validate what it reads, your 85%-accurate system fails 80% of the time on a 10-step task.

I Googled Myself. The AI Got It Wrong. Then It Argued.

A few weeks ago, I tried something simple. I opened Google AI Mode and asked it about my own blog, vikaspratapsingh.com. It returned a description that was partially correct: yes, it covers Data Governance and AI. But the rest was wrong. It described articles I have never written and topics I have never covered.

So I asked the follow-up question any reasonable person would: “Who is the author?” Google AI Mode returned a confident biography of someone who is not me. Different name. Different career history. Different city. I corrected it. It came back with a new wrong answer. I corrected it again. A third wrong answer, delivered with the same confidence as the first.

This was not a hallucination in the traditional sense. The LLM was not generating facts from its parametric memory. It was calling a tool (Google’s own search index), receiving incorrect or incomplete data, and then reasoning faithfully within that corrupted context. Each time I provided a correction, the agent weighed my input against the tool’s indexed data and chose to trust the tool. The corrections never stuck because the source kept winning.

This small, personal experience captures something much larger. The context window is becoming a critical data pipeline in enterprise AI, yet in most agent stacks there is still no explicit, measurable quality-control layer between tool outputs and model reasoning.

The Context Window Is the New Data Pipeline

Andrej Karpathy called the context window “the LLM’s RAM” in June 2025. The analogy is useful but incomplete. RAM is passive storage. The context window is an active data pipeline: information flows in from tool calls, retrieval systems, user inputs, and prior conversation turns, and the LLM reasons over all of it to decide what to do next. Every action the agent takes depends on the quality of what is inside that window.

We spent two decades building Data Quality into enterprise data pipelines. ETL jobs have validation rules. Data warehouses have constraint checks. Data Observability platforms monitor freshness, volume, and schema drift. The DAMA-DMBOK defines six core dimensions of Data Quality that any data practitioner can recite from memory: Accuracy, Completeness, Timeliness, Consistency, Validity, and Uniqueness.

None of these protections have been standardized for the context window. In most common agent architectures, when a tool returns a response, that response enters the context and is treated as trusted input by default. There is no standard validation, no quality check, no anomaly detection between the tool result and the LLM’s reasoning.

As VentureBeat put it in a recent analysis: “AI agents trust the context given to them implicitly” and “you cannot let an agent drink from a polluted lake.” The metaphor is apt, but the industry has not yet built the water treatment plant.

The Evidence Is Already In

This is not a theoretical risk. The failures are public, documented, and expensive.

Search agents return fabricated information at scale. Google’s AI Overviews, rolled out to over a billion users, recommended putting glue on pizza (sourced from a joke Reddit post), told users to eat rocks for minerals (sourced from an Onion article), and generated medical misinformation that a Guardian investigation in January 2026 documented in detail. A Mount Sinai study confirmed that AI chatbots “present false medical details with confidence.” The tool (web search) returned data. The agent trusted it. The output was wrong.

Customer service agents create legal liability. In 2024, Air Canada’s chatbot confidently told a grieving passenger he could book a full-fare ticket and apply for a bereavement discount retroactively. That policy did not exist. A tribunal held Air Canada legally liable for the chatbot’s promise. The agent’s context included incorrect policy information, and the LLM did what LLMs do: it reasoned within the context it was given and produced a clear, confident, wrong answer.

Even OpenAI cannot trust raw tool results. When OpenAI built their internal data analysis agent, they discovered it required layered context grounding, systematic evaluations, and runtime inspection before it produced reliable outputs. If the organization that builds GPT cannot trust raw tool responses fed into a context window without multiple layers of validation, no one should assume their agent can either.

Research confirms agents never question tool data. The AgentDrift paper, an arXiv preprint published in March 2026, tested what happens when tool outputs fed into the context window contain contaminated data. Across 1,563 contaminated tool-output turns and 7 different LLMs, no agent ever questioned the reliability of the data it received from tools. Safety violations appeared in 65-93% of turns, but standard quality metrics (task completion rates, response coherence) stayed stable. The agents looked like they were working fine. They were not. The study focused on high-stakes recommendation settings, but the mechanism it reveals, agents accepting contaminated tool data without questioning it, applies across agent architectures.

That last finding deserves emphasis. Standard evaluation metrics do not catch this failure mode. An agent can complete its assigned task, produce coherent output, and score well on benchmarks while operating on corrupted data the entire time. The metrics measure execution quality, not input quality.

The Compound Error Problem

There is a mathematical dimension to this that most agent builders have not internalized.

Assume each step in an agentic workflow has an 85% chance of producing a correct result. That sounds reasonable: not perfect, but good enough for a single operation. Now chain ten steps together, which is a typical agentic workflow for tasks like research, analysis, or code generation.

The probability of the entire chain succeeding is 0.85^10 = 0.20.

An 85%-accurate agent fails 80% of the time on a 10-step task. This calculation assumes independence between steps and serves as an illustrative model, not observed field data, but the directional point holds: compound error accumulation is brutal even under generous assumptions.

Compound error cascade: an 85% per-step accuracy degrades to 20% over 10 sequential steps, showing how reliability collapses in multi-step agent workflows

This is not a surprising result to anyone who has worked in data pipelines. It is the same reason we build validation at every stage of an ETL process rather than only checking the final output. But agent architectures do not have stage-level validation. The context window accumulates tool responses across steps, and each subsequent step reasons over the entire accumulated context, errors included.

The compound AI systems research has documented this pattern extensively. Every additional component in a compound system multiplies the failure surface. The difference is that in traditional systems, we built monitoring and validation between components. In agent systems, we have not.

This math explains a broader pattern. Gartner predicts 60% of AI projects will be abandoned through 2026 due to data readiness failures. They separately predict that over 40% of agentic AI projects will be canceled by 2027. MIT Sloan Management Review has argued that the majority of enterprise AI investments produce no measurable return. These are not unrelated statistics. They are symptoms of the same root cause: organizations deploy AI systems on top of data infrastructure that cannot support them.

The Six Data Quality Dimensions, Mapped to the Context Window

The framework for fixing this already exists. We just have not applied it to the right pipeline.

The six dimensions of Data Quality defined in the DAMA-DMBOK have been the standard for assessing data fitness for two decades. Every enterprise data team knows them. Here is what each dimension looks like when applied to the context window instead of a database or warehouse.

The six Data Quality dimensions mapped as a quality layer between tool calls and the context window, showing the gap where validation should exist

1. Accuracy: Is the tool-returned data factually correct?

In a data warehouse, Accuracy means the stored value matches the real-world entity it represents. In the context window, it means the same thing: does the tool response reflect reality?

My Google AI Mode experience is a textbook Accuracy failure. The search tool returned data about my blog. The data was wrong. The agent reasoned over it as if it were correct.

What this looks like in practice: A coding agent calls a documentation API and receives an outdated function signature. A research agent retrieves a Wikipedia summary that has been vandalized. A customer service agent pulls from a knowledge base that contains a policy that was updated last quarter but never refreshed in the retrieval index.

What a quality check would do: Cross-reference tool outputs against a second source. Flag responses with low-confidence indicators. Implement a “trust but verify” pattern where critical data points are validated before the agent acts on them.

2. Completeness: Did the tool return all relevant information?

In a warehouse, Completeness means no required fields are NULL and no records are missing. In the context window, it means the tool returned enough information for the agent to reason correctly.

What this looks like in practice: A retrieval-augmented generation (RAG) system returns the top 3 chunks from a 200-page document, and the answer depends on chunk 47. An API returns paginated results but the agent only reads page 1. A search tool truncates results at 500 tokens, cutting off the most relevant paragraph.

What a quality check would do: Detect when tool responses hit length or pagination limits. Assess whether the returned context is sufficient for the stated task. Trigger follow-up queries when coverage appears incomplete.

3. Timeliness: Is the data current?

In a warehouse, Timeliness means the data reflects the most recent state of the source system. In the context window, it means the tool response is not stale.

What this looks like in practice: A financial agent queries a market data API that returns prices delayed by 15 minutes. A legal research agent retrieves case law that does not include a recent ruling that changed the precedent. A compliance agent pulls from a regulatory database last updated before a new rule took effect.

What a quality check would do: Check timestamps on tool responses. Flag data older than a configurable threshold. Distinguish between data that is intentionally historical and data that is unintentionally stale.

4. Consistency: Do multiple tool results contradict each other?

In a warehouse, Consistency means the same entity has the same value across tables. In the context window, it means different tool responses do not contradict each other.

What this looks like in practice: An agent queries two APIs about the same company’s revenue and gets different numbers. A research agent retrieves three articles that make contradictory claims about a scientific finding. A planning agent receives schedule data from two systems with conflicting time zones.

What a quality check would do: Compare overlapping claims across tool responses. Flag contradictions before the agent proceeds. Implement conflict resolution rules: which source wins when they disagree?

5. Validity: Does the data conform to expected formats and domains?

In a warehouse, Validity means values fall within acceptable ranges and formats. In the context window, it means the tool response is structurally sound and semantically appropriate for the task.

What this looks like in practice: A data analysis agent receives a JSON response with unexpected schema changes. A code-generation agent receives API documentation in a format it was not designed to parse. A conversational agent receives HTML markup when it expected plain text.

What a quality check would do: Validate response schemas against expected formats. Check that returned values fall within plausible ranges. Reject responses that cannot be parsed into the expected structure.

6. Uniqueness: Are there duplicate or redundant entries?

In a warehouse, Uniqueness means no duplicate records inflate counts or skew analysis. In the context window, it means redundant information does not waste token budget or bias the agent’s reasoning through repetition.

What this looks like in practice: A RAG system retrieves five chunks that contain near-identical text from different pages of the same document. A multi-tool agent calls the same API twice with slightly different parameters and gets back overlapping results. Duplicate context fragments cause the agent to over-weight certain information simply because it appears more frequently.

What a quality check would do: Deduplicate tool responses before they enter the context. Detect semantic overlap across retrieved chunks. Manage the token budget by compressing redundant information.

Why Existing Guardrails Miss This Entirely

The AI safety ecosystem has invested heavily in two things: input safety (prompt injection detection, content filtering on user messages) and output quality (hallucination detection, toxicity filters, response grounding). These are important. They are also insufficient.

The gap is between tool responses and LLM reasoning. Consider a typical agent architecture:

User sends a prompt (input filters check this)
Agent decides to call a tool
Tool returns data into the context window (no standard quality gate exists here)
Agent reasons over the full context
Agent produces an output (output filters check this)

Step 3 is the blind spot. The tool response enters the context with the same status as verified ground truth. The agent has no mechanism to question it, weight it by reliability, or flag it for validation. The AgentDrift researchers confirmed this: across all 7 LLMs tested, not a single agent questioned tool-data reliability, even when the data was overtly contaminated.

This is the equivalent of building a Data Quality program that monitors the dashboards but never checks the source systems. You catch formatting errors in the reports while the underlying data silently corrupts every decision downstream.

Organizations building agentic AI systems need to treat the context window as a data pipeline and apply the same discipline to it that we have spent twenty years perfecting for warehouses, lakes, and streaming systems. The AI Governance frameworks that enterprises are adopting need to extend their scope to cover this pipeline. The Data Governance foundations that organizations have built need to apply inside the agent, not just outside it.

What to Do Next

Whether you are building agents or deploying third-party agentic tools, these actions fall within your direct control:

Priority	Action	Why it matters
This week	Audit your agent’s tool calls and log every response entering the context window	You cannot improve what you cannot see; most teams have little to no visibility into what their agents consume
This week	Classify each tool source by reliability tier (verified, semi-trusted, unverified)	Not all sources deserve equal trust; the agent should know the difference
This month	Implement schema validation on tool responses before they enter the context	Invalid or malformed data is the easiest failure mode to catch and the most common one to miss
This month	Add cross-reference checks for critical data points (query a second source for high-stakes decisions)	Single-source trust is the root cause of the Air Canada failure and most agent errors
This quarter	Build a context quality dashboard with the 6 DQ dimensions as metrics	Treat the context window like a pipeline: monitor freshness, completeness, accuracy, consistency, validity, and uniqueness
This quarter	Establish escalation rules: when context quality falls below threshold, stop the agent and involve a human	Automated systems need circuit breakers; agents without them will confidently execute on bad data

What Comes Next

This article establishes the problem: the context window is an unmonitored data pipeline, and the traditional Data Quality dimensions give us a precise vocabulary for what is missing. But identifying the gap is only the first step.

The next article in this series explores the architectural gap in detail: where exactly in the agent stack should context quality controls live, what patterns are emerging from teams that have started building them, and what a reference architecture for context-quality-aware agents looks like. The framework matters, but the engineering is where it becomes real.

This article is related to The Practitioner’s Guide to AI Agents, a nine-part series on building, evaluating, and improving AI agents.