AI Governance & Safety March 25, 2026 · 16 min read

Context Is the Program: Why Data Quality Inside the Agent Matters More Than the Model

Pike's Rule 5 says data dominates. In AI agents, the context window IS the data structure. This article traces why context quality determines agent behavior more than model capability, maps the five criteria that define good context, and shows what happens when stale data enters the reasoning loop unchecked.

By Vikas Pratap Singh

#ai-agents #context-engineering #data-quality #ai-governance #pike-rules #tool-calling

Executive Briefing

What this covers: A deep dive on Pike's Rule 5 ('Data dominates') applied to AI agents. The context window is the agent's data structure, and its quality determines behavior more than the model architecture.
Who should read it: Data leaders, AI platform engineers, and product managers building or evaluating agentic AI systems who want to understand where to invest for the highest reliability gains.
Key takeaway: Upgrading the model is the wrong optimization target. Chroma Research tested 18 models and found that performance degrades continuously as context grows, regardless of model capability. The right data, structured and validated, beats more data in a bigger window every time.
The practical implication: A freshness check on tool results takes 10 lines of code and catches the most common silent failure mode in agent architectures. Context quality controls are cheap to build and expensive to skip.

Part 6 of 12: The Practitioner’s Guide to AI Agents

”Data Dominates”

Pike’s Rule 5 says data dominates. In Article 3, I showed why this translates to “context dominates” for agents. This article goes deeper into what that means in practice.

I learned this the hard way on a Data Governance engagement at Capgemini. A client needed to classify their Critical Data Elements. The expected count was somewhere between 300 and 400: the elements that drive regulatory reporting, risk calculations, and key business decisions. Reasonable for an organization of that size.

The classification logic had a flaw. Instead of 300 to 400 CDEs, the system tagged over 2,000 data elements as critical. Nobody caught it during validation because the output looked structurally correct: every element had a classification, every classification had a rationale, every rationale cited a business process. The data was valid. It was also wrong.

The error surfaced during an audit. Regulators asked for evidence that each critical element had documented controls, lineage, and quality monitoring. Producing that evidence for 2,000 elements instead of 400 created weeks of rework and pushed the program timeline. The root cause was not a model failure or a tooling gap. It was bad data entering a classification process that nobody validated before trusting the output. The same pattern that breaks agents.

In agents, the context window is the data structure. It holds everything the model can see when it makes a decision: the user’s request, the conversation history, tool results, and any retrieved documents. The model is the algorithm.

If you fill the context window with the right information, organized well, the model’s reasoning follows naturally. If you fill it with stale, partial, or contradictory data, no model capability can compensate. The model will reason faithfully over whatever it receives and produce a confident, well-structured, wrong answer.

Rule 5 gets its own article because it connects to the most consequential architectural gap in how agents are built today. The evidence is now strong enough to state it plainly: invest your time in context quality, not model upgrades.

The Context Window as a Data Pipeline

I mapped the six Data Quality dimensions to the context window in an earlier article and identified the architectural gap at Boundary 3 in a follow-up. The key finding: tool results enter the context with no semantic validation, and standard quality metrics do not catch the resulting failures.

Context quality determines output: two paths showing how validated context leads to correct reasoning while unvalidated context leads to confident wrong answers

The two-path diagram captures the core argument. Both paths use the same model. Both produce output with the same confidence level. The only difference is whether the context was validated before the model reasoned over it. The model cannot tell good context from bad. That is not a model limitation; it is a design property. The quality of what it receives is your responsibility.

Those articles establish the problem. This one is about the solution: five engineering criteria that define what “good context” means, with enough specificity to build against.

Context Engineering as a Discipline

The term “Context Engineering” has solidified into a formal discipline in early 2026. Andrej Karpathy described the shift: the core skill in agent development is not prompt engineering but Context Engineering. The distinction matters. Prompt engineering optimizes the wording of a query. Context Engineering designs the entire informational environment in which the agent makes decisions.

Vishnyakova (2026) formalized five criteria that define what “good context” means. These are the engineering specification for Context Quality: not aspirational, but the minimum properties context must satisfy before an agent acts on it.

1. Relevance: Only What the Agent Needs Right Now

The agent receives only the information needed for the current decision. Excessive context is not harmless; it actively degrades performance (see the Chroma findings below). A coding agent fixing a bug in one file does not need the entire repository in its window.

What this looks like in practice. A customer support agent receives the user’s full ticket history across 200 interactions spanning three years. The current question is about a billing error from last week. The agent has 199 irrelevant tickets polluting its reasoning, and the model gives equal weight to a complaint from 2023 and the actual issue.

How to build the check. Before injecting context, score each item against the current query using embedding similarity or keyword overlap. Set a relevance threshold (start with cosine similarity > 0.7) and drop items below it. Log what gets dropped so you can tune the threshold. The LangChain Context Engineering guide calls this the “select” operation.

2. Sufficiency: Everything Needed, Nothing Hallucinated

The context contains everything the agent needs without forcing it to fill gaps through hallucination. When data is missing, the agent invents plausible assertions. Sufficiency is a guard against hallucination at the architectural level.

What this looks like in practice. A research agent is asked to compare Q1 revenue across three competitors. Its search tool returns data for two of the three. Instead of reporting the gap, the agent estimates the third company’s revenue based on “industry averages” it fabricated. The output reads as a complete comparison. It is not.

How to build the check. Define required fields per task type. Before the agent reasons, verify each required field has a populated value from a tool result. If any required field is missing, the agent should report the gap rather than fill it. A simple implementation: maintain a checklist of expected data points and flag any that remain empty after all tool calls complete.

3. Isolation: Each Agent Sees Only Its Own Data

In multi-agent systems, each sub-agent sees only its own context. Data leakage between roles creates both controllability failures and security vulnerabilities. An agent that sees everything from its peers makes decisions based on data it was not designed to interpret.

What this looks like in practice. A code review agent and a deployment agent share a context window. The deployment agent sees internal code comments containing hardcoded test credentials from the review context. It includes them in a deployment log that gets shipped to an external monitoring service.

How to build the check. Give each agent its own context scope. Pass only structured outputs (not raw context) between agents. Implement a “context firewall” that strips fields tagged as internal before any cross-agent handoff. If you are using a multi-agent framework, verify that agent A cannot read agent B’s system prompt or raw tool results.

4. Economy: Every Token Has a Cost

Every token costs money, time, and latency. Manus reported a 10x cost differential between cached and uncached tokens, with a 100:1 input-to-output ratio. Context architecture decisions dominate unit economics.

What this looks like in practice. An agent retrieves a 15-page legal document for every query, even when only one paragraph is relevant. At 200K tokens per session and $15/M input tokens, each session costs $3. With caching and compression, the same session costs $0.30. Multiply by thousands of daily sessions and the difference is the project’s budget.

How to build the check. Track token counts per tool result. Set a per-result budget (e.g., 2,000 tokens for a search snippet, 10,000 for a document retrieval). Compress results that exceed the budget using extractive summarization or chunking with relevance filtering. Monitor your cached-to-uncached token ratio weekly. If it drops below 60%, your context architecture needs restructuring.

5. Provenance: Traceable, Auditable, Trustworthy

Every element of context must be traceable to which system produced it, when, and with what trust level. Without provenance, you cannot audit agent decisions, debug errors, or satisfy regulatory requirements.

What this looks like in practice. An agent produces a financial summary citing “company revenue of $12M.” When the number turns out to be wrong, the team cannot determine whether the error came from the CRM, the finance database, or a stale cached result. They rebuild the entire pipeline to find the bug. It takes three days.

How to build the check. Attach metadata to every tool result before it enters the context: source_name, retrieved_at (ISO timestamp), ttl_seconds, and trust_tier (verified / semi-trusted / unverified). Store this metadata alongside the result in a structured envelope. When debugging, filter the context log by source and timestamp to isolate the faulty input in minutes, not days.

Code example: provenance envelope and contradiction detection

from datetime import datetime, timedelta

def wrap_with_provenance(tool_result: dict, source: str, ttl_seconds: int = 3600,
                         trust_tier: str = "semi-trusted") -> dict:
    """Attach provenance metadata to a tool result."""
    return {
        "data": tool_result,
        "provenance": {
            "source": source,
            "retrieved_at": datetime.utcnow().isoformat() + "Z",
            "ttl_seconds": ttl_seconds,
            "trust_tier": trust_tier,
        },
    }

def detect_contradiction(result_a: dict, result_b: dict, entity_key: str) -> dict:
    """Flag when two tool results disagree about the same entity."""
    value_a = result_a.get("data", result_a).get(entity_key)
    value_b = result_b.get("data", result_b).get(entity_key)
    if value_a is None or value_b is None:
        return {"status": "incomplete", "reason": f"'{entity_key}' missing from one source"}
    if value_a != value_b:
        return {
            "status": "contradiction",
            "entity_key": entity_key,
            "source_a": {"value": value_a,
                         "source": result_a.get("provenance", {}).get("source", "unknown")},
            "source_b": {"value": value_b,
                         "source": result_b.get("provenance", {}).get("source", "unknown")},
            "action": "Do not proceed. Resolve before injecting into context.",
        }
    return {"status": "consistent", "entity_key": entity_key, "value": value_a}

# Two tools return different revenue figures for the same company
crm = wrap_with_provenance(
    {"company": "Acme Corp", "annual_revenue": 12_000_000},
    source="CRM", trust_tier="semi-trusted"
)
finance = wrap_with_provenance(
    {"company": "Acme Corp", "annual_revenue": 9_800_000},
    source="Finance DB", trust_tier="verified"
)
check = detect_contradiction(crm, finance, "annual_revenue")
# check["status"] == "contradiction"
# Provenance tells you Finance DB (verified) disagrees with CRM (semi-trusted)
# Resolution: prefer the verified source, or escalate for human review

These five criteria are not a wishlist. They are the engineering specification for what “quality context” means. Every tool result that enters the context window should be evaluated against them. The checks do not require a new framework or a research breakthrough. They require treating the context window with the same discipline we apply to every other data pipeline in enterprise systems.

Context Placement Tactics

Knowing what should go into the context window (the five criteria) is half the problem. Knowing where to place it within the window is the other half. Position affects attention. Models do not weight all parts of the context equally, and getting placement wrong causes silent failures even when the context itself is high quality.

The Case Facts Block

In any task involving specific data points (customer IDs, dates, financial figures, legal terms), those facts must be stated once, clearly, at the top of the context, before any analysis or instructions. This is the case facts block.

CASE FACTS (do not modify or summarize these values):
- Customer ID: CUST-29847
- Invoice amount: $14,290.00
- Invoice date: 2026-03-15
- Payment due: 2026-04-14
- Account status: Active, Tier 2

The block serves two purposes. First, it gives the model a single authoritative source for critical data. When the same customer ID appears in three different tool results with slight formatting differences (CUST-29847, cust_29847, 29847), the case facts block establishes which representation is canonical. Second, the explicit instruction “do not modify or summarize” prevents the model from paraphrasing numbers during internal reasoning. Models occasionally round, truncate, or reformat values when processing long contexts. Pinning facts at the top with a preservation instruction reduces this.

The Lost-in-the-Middle Effect

Research on long-context models shows a consistent pattern: models pay the most attention to information near the beginning and end of the context window, with a noticeable dip in recall for information in the middle. This U-shaped attention curve means that identical information placed at position 1,000 vs. position 50,000 in a 100K-token context may produce different results.

The practical implication for agent builders:

Beginning of context: Place the case facts block, critical instructions, and the user’s original request. These are the elements the model must not lose track of across a long reasoning chain.
End of context: Place the most recent tool results and the current step’s instructions. The model’s attention naturally focuses here as it decides its next action.
Middle of context: Place reference material, historical tool results, and verbose documentation. This is where the model’s recall is weakest, so put information here that is useful for reference but not critical for the current decision.

This is not about dumping less-important data in the middle. It is about recognizing that the model’s attention architecture has a known weakness and designing around it. If your agent consistently misses information from early tool calls in a multi-step task, the information has likely drifted to the middle of a growing context. Moving it to a pinned section at the top can fix the problem without changing the model.

Progressive Summarization Danger

As conversations grow, a common optimization is to summarize older tool results to save tokens. Summarize the first five search results so the model has room for the sixth. Compress yesterday’s findings into a paragraph. This works for narrative content. It is dangerous for data with precision requirements.

When a model summarizes “$14,290.00 invoice due 2026-04-14 for CUST-29847,” it might produce “approximately $14K invoice due mid-April.” Each transformation is small. Over multiple summarization rounds, the agent’s context drifts from the original facts. A date shifts. An amount rounds. An ID disappears. The agent continues reasoning as if its context is accurate, because it has no way to know that the summary lost precision.

The rule: never summarize values that downstream decisions depend on. Summarize narrative and explanation freely. Preserve numbers, dates, identifiers, and quoted text verbatim. If token pressure forces you to compress, extract the critical values into the case facts block before summarizing the surrounding text.

def compress_tool_result(result: dict, critical_fields: list[str]) -> dict:
    """Compress a tool result while preserving critical fields verbatim."""
    preserved = {k: result[k] for k in critical_fields if k in result}
    # Summarize non-critical fields (narrative, descriptions, etc.)
    summary = summarize_text(
        {k: v for k, v in result.items() if k not in critical_fields}
    )
    return {"preserved_facts": preserved, "summary": summary}

This function separates what must be exact from what can be compressed. The preserved facts stay in the case facts block at the top of the context. The summary goes in the middle where token savings matter most and precision matters least.

Context Rot: More Tokens Do Not Mean Better Performance

The most counterintuitive finding in recent agent research is that more context makes agents worse, not better.

The Chroma Research “context rot” study tested 18 state-of-the-art models and found that performance degrades continuously as input context grows, even on simple tasks. There is no safe threshold. Degradation begins immediately and worsens with every additional token. The study coined the term “context rot” to describe this measurable, reproducible decline.

This result challenges a common assumption in agent architecture: that bigger context windows are better, that 1M tokens is an improvement over 128K, that the solution to incomplete context is more context. The Chroma data says otherwise. The right data beats more data. Larger windows give you more room to fill with noise.

DeepMind documented an extreme example of context self-poisoning in their Gemini 2.5 technical report. Their Pokemon-playing agent hallucinated game state information that persisted in the context window, causing hours of wasted effort. The agent made decisions based on a game board that did not match reality. The context was self-poisoned, and the agent had no mechanism to detect or correct it.

The LangChain Context Engineering guide identifies four operations that address context rot: write (add information), select (retrieve what is relevant), compress (reduce what is redundant), and isolate (restrict what each component sees). These four operations are context Data Quality in practice. They enforce Relevance, Economy, Uniqueness, and Isolation from Vishnyakova’s five criteria.

The practical implication is clear. Every tool result you allow into the context window without validation is not just a potential source of bad reasoning. It is also actively degrading the quality of reasoning on everything else already in the window. Stale data does not sit quietly alongside good data. It poisons the entire inference.

Context quality is not a one-time check. In multi-turn agents, a result that was fresh on turn 3 may be stale by turn 15. Implement rolling freshness checks that re-validate cached tool results against their source TTLs. If a tool result’s TTL has expired since it entered the context, either re-fetch it or flag it as potentially outdated before the agent reasons over it again.

Without this, long-running agent sessions accumulate stale data the same way a warehouse accumulates schema drift: silently, until something breaks.

What Happens When a Tool Returns Stale Data

The most common context quality failure is also the quietest: a tool returns data that was accurate at some point but is no longer current. The agent has no way to know the difference. It reasons over the stale data with the same confidence it would apply to fresh data, and produces an answer that looks correct but is not.

Code example: stale tool data causes incorrect reasoning, then a freshness check catches it

from datetime import datetime, timedelta

# A tool returns pricing data, but it's from 2019
def get_product_pricing(product_id: str) -> dict:
    # Simulates a tool call to a pricing API
    return {
        "product_id": product_id,
        "price": 49.99,
        "currency": "USD",
        "last_updated": "2019-03-15T00:00:00Z",  # 7 years stale
    }

# WITHOUT freshness check: agent trusts the result
result = get_product_pricing("SKU-4821")
# Agent sees: "The current price is $49.99"
# Agent responds: "The product costs $49.99."
# Reality: the price changed to $89.99 in 2024.

# WITH freshness check: staleness is caught before reasoning
def validate_freshness(tool_result: dict, max_age_days: int = 90) -> dict:
    last_updated = datetime.fromisoformat(
        tool_result["last_updated"].replace("Z", "+00:00")
    )
    age = datetime.now(last_updated.tzinfo) - last_updated
    if age > timedelta(days=max_age_days):
        tool_result["_quality"] = {
            "stale": True,
            "age_days": age.days,
            "warning": f"Data is {age.days} days old. Do not treat as current.",
        }
    return tool_result

result = validate_freshness(get_product_pricing("SKU-4821"))
# result["_quality"]["stale"] is True
# result["_quality"]["warning"] tells the agent the data is 2,568 days old
# Agent can now: refuse to answer, fetch a second source, or caveat its response

The freshness check is 10 lines of code. It does not require a new framework, a vector database, or a model upgrade. It requires treating tool results as untrusted input and validating them before they enter the context.

This pattern, validate before trust, is the same principle we apply to every other data pipeline in enterprise systems. ETL jobs have validation rules. Data warehouses have constraint checks. Streaming systems have schema registries. The context window has none of these by default. Adding them is an engineering decision, not a research problem.

Attach provenance metadata to every tool result before injecting it into context: source name, retrieval timestamp, and TTL. This metadata serves double duty: the agent can use it for self-evaluation, and your audit system can use it for compliance. When an agent makes a bad decision, provenance metadata lets you trace the cause to a specific tool result from a specific source at a specific time. Without it, debugging agent failures is guesswork.

The Practical Implication: Invest in Context, Not Models

OpenAI discovered this themselves when building their internal data analysis agent. They found that reliable outputs required layered context grounding, systematic evaluations, and runtime inspection before the agent produced trustworthy results. The model was GPT-4. The problem was not model capability. The problem was that raw tool results could not be trusted without multiple layers of validation.

The AgentDrift study confirmed this across seven LLMs: not one agent questioned tool-data reliability. The model did not matter. The context quality did. (I covered the full AgentDrift findings in the Data Quality article.)

OpenAI’s harness engineering experiment demonstrated the same principle in a different domain: Can Boluk improved 15 LLMs’ coding performance in a single afternoon by changing only the harness, not the model, with one model jumping from 6.7% to 68.3% success rate. The harness is context engineering applied to coding agents: custom linters, structural tests, and teaching error messages that control what enters the agent’s feedback loop.

This is Pike’s Rule 5 in action. The data structure (context) dominates. The algorithm (model) follows. If you have chosen the right context and organized it well, the model’s reasoning will almost always be self-evident. If you have not, no model upgrade will save you.

The teams that are building reliable agents are not the teams with the biggest models or the longest context windows. They are the teams treating the context window as a data pipeline and applying the same discipline to it that we spent two decades perfecting for warehouses, lakes, and streaming systems. Source reliability scoring. Freshness validation. Contradiction detection. Schema checks. Completeness monitoring. Provenance metadata.

These are not new inventions. They are Data Quality fundamentals applied to the right pipeline.

In Article 6, I cover how evals measure whether context quality actually translates to output quality. Good context is necessary but not sufficient. You also need to measure whether the agent does the right thing with it.

Do Next

Priority	Action	Why it matters
No experience	Ask your AI tool a question you know the answer to, then check whether it got it right. If it got it wrong, ask yourself: was the model bad, or was the data it used bad?	Most agent failures are context failures, not model failures. Recognizing the difference changes how you evaluate AI tools.
No experience	Read the Chroma context rot study summary. It takes five minutes.	Understanding that more context degrades performance is counterintuitive and essential. It changes how you think about “bigger is better” in AI.
Learning	Add a freshness check to one tool in your agent stack. Log how often tool results are older than 90 days.	This is the cheapest, highest-impact context quality control you can build. Most teams are shocked by how much stale data enters their agents unchecked.
Learning	Audit your agent’s context window for one production task. Print the full context and read it yourself. Is everything in there relevant? Is anything missing? Is anything stale?	You cannot improve what you cannot see. Most developers have never read the full context their agent reasons over.
Practitioner	Implement the five criteria (Relevance, Sufficiency, Isolation, Economy, Provenance) as a validation layer on tool results before they enter the context window.	This is the Context Quality Layer described in the architecture article. It is the single highest-leverage architectural investment for agent reliability.
Practitioner	Measure your cached-to-uncached token ratio and calculate the cost impact of context compression.	Economy is not optimization; it is viability. Manus found a 10x cost differential. Your ratio determines whether your agent is economically sustainable at scale.

This is Part 6 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Build a Real Agent This Weekend · Next: Evals: How to Know If Your Agent Works →