AI Governance & Safety April 4, 2026 · 14 min read

Context Engineering, Formalized: Five Criteria That Validate the Agent Quality Thesis

Vishnyakova's 'Context Engineering' paper (arXiv 2603.09619) proposes five production-grade quality criteria for agent context and a four-level maturity pyramid. The framework independently validates the thesis from our three-part agent quality series and extends it with Isolation, Economy, and two higher-order disciplines: Intent Engineering and Specification Engineering.

By Vikas Pratap Singh

#context-engineering #ai-agents #data-quality #ai-governance #paper-decoded #multi-agent-systems

Executive Briefing

The paper: Vishnyakova (2026) introduces Context Engineering as a standalone discipline with five quality criteria (Relevance, Sufficiency, Isolation, Economy, Provenance) and frames context as 'the agent's operating system.' Published March 2026, arXiv 2603.09619.
Why it matters: The paper independently validates the thesis from this blog's three-part agent quality series: the context window is an unmonitored data pipeline, it needs engineered quality controls, and automated checks are necessary but not sufficient without human judgment.
What it adds: Two concepts absent from the blog's framework: Isolation (sub-agents must not see each other's context) and a four-level maturity pyramid where Context Engineering is necessary but not sufficient. Intent Engineering and Specification Engineering sit above it.
The convergence signal: Multiple recent analyses are pointing toward the same structural insight in early 2026. When independent analyses converge, the underlying thesis is likely correct.

The Series That Got Validated

Over the past two weeks, I published a three-part series on agent quality. Part 1 argued that the context window is enterprise AI’s most consequential unmonitored data pipeline. Part 2 mapped the architectural gap where quality checks should exist but don’t. Part 3 made the case that human judgment, not automation, is the irreplaceable layer above any quality gate.

Three days after the series went live, a colleague sent me a link to an academic paper I had not seen during research: Context Engineering: From Prompts to Corporate Multi-Agent Architecture by Vera V. Vishnyakova at HSE University, Moscow. Published March 10, 2026, it arrived at the same structural insight from a different starting point.

What struck me was not that someone else was thinking about context quality. Plenty of people are. What struck me was the precision of the overlap. The paper proposes five production-grade quality criteria for agent context. I had proposed six Data Quality dimensions mapped to the context window. The paper frames context as “the agent’s operating system.” I had called it “the newest data pipeline in enterprise AI.” The paper argues that context quality is necessary but not sufficient and introduces Intent Engineering as the next layer. I had defined judgment-in-the-loop as the human role that automated checks cannot replace.

When independent analyses converge on the same structure from different directions, the underlying thesis is probably correct. This article decodes the paper, maps it against the blog series, and identifies what each framework adds that the other misses.

The Paper in 90 Seconds

Vishnyakova’s 25-page paper makes three core contributions.

First, it formalizes Context Engineering (CE) as a standalone discipline distinct from Prompt Engineering. Where Prompt Engineering optimizes the wording of a query, Context Engineering designs the entire informational environment in which an agent makes decisions: memory, policies, tool outputs, corporate constraints, prior-step history, and visibility boundaries for sub-agents.

Second, it proposes five production-grade quality criteria for agent context: Relevance, Sufficiency, Isolation, Economy, and Provenance. These are the properties that “good context” must satisfy before an agent acts on it.

Third, it introduces a four-level cumulative maturity pyramid:

Prompt Engineering: Optimizing individual queries. Scale: one person, one call.
Context Engineering: Designing the agent’s informational environment. Scale: team or product.
Intent Engineering: Encoding corporate goals, values, and trade-off hierarchies into agent infrastructure. Scale: business unit.
Specification Engineering: Creating a machine-readable corpus of corporate policies, quality standards, and operational procedures. Scale: corporation.

Each level subsumes the previous one as load-bearing infrastructure. No level cancels the one below it.

The paper’s central thesis: “Whoever controls the agent’s context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.”

Five Criteria, Decoded

The paper’s Section 9 defines the five criteria. Vishnyakova is candid about their status: “This taxonomy is working, not canonical; the industry has not yet developed a unified standard.” Here is what each criterion means in practice.

1. Relevance: Only What This Step Needs

The agent receives the minimum information sufficient for the current decision. Excessive context is not harmless. It causes lost-in-the-middle degradation, distracts the model, and increases cost. Good context is not “everything available” but “the minimum sufficient for the decision.”

In practice: A coding agent fixing a bug in file X does not need the entire repository in its context window. A customer service agent resolving a billing dispute does not need the customer’s browsing history. Vishnyakova cites Breunig’s context rot research: when the Gemini 2.5 agent played Pokemon and its context exceeded 100,000 tokens, it stopped synthesizing new plans and began repeating patterns from its accumulated history. It hallucinated a nonexistent game item called “TEA” and spent hours trying to obtain it.

2. Sufficiency: Everything Needed, No Guesswork

The context must contain everything the agent needs for a decision without filling gaps through hallucination. When data is missing, the agent invents plausible but false assertions. Sufficiency is a guard against hallucinations at the architectural level.

In practice: A financial analysis agent asked to compare quarterly revenue must have access to all four quarters of data, not just two. A legal research agent summarizing case law on a topic must retrieve cases from the relevant jurisdiction, not just the top three vector-similarity matches. If the agent cannot tell the difference between “the data says X” and “I am guessing X because the data is incomplete,” sufficiency has failed.

3. Isolation: Sub-Agents See Only Their Own Context

In multi-agent systems, each sub-agent must see only its own context. Data leakage between roles is both a controllability and a security problem. An agent that sees everything from its peers makes decisions based on data it was not designed to interpret.

In practice: Vishnyakova’s compliance control system provides a concrete example. The detector agent returned dozens of false-positive Named Entity Recognition (NER) hits into the shared context. The coordinator agent did not need raw hits; it needed a verdict and a confidence score. Without isolation, the coordinator was making decisions based on noise from a classifier it was not equipped to evaluate.

The StrongDM software factory case demonstrated the security dimension: AI agents systematically gamed traditional unit tests by writing return true or rewriting test assertions to match buggy code. StrongDM’s fix was to move test specifications outside the codebase into separate scenarios that agents could not see or modify, enforcing isolation at the infrastructure level.

Tomasev et al. (2026) formalize this as “privilege attenuation”: upon sub-delegation, an agent transfers only a strictly limited slice of its rights to the sub-agent. Google’s Agent-to-Agent (A2A) protocol implements controlled isolation at the protocol level, allowing agents to interact without exposing internal state.

4. Economy: Minimum Tokens, Maximum Decision Quality

Every token in the context costs money, time, and latency. Context architecture is directly the product’s unit economics. According to Manus (2025), the cost difference between cached and uncached tokens in their production system reached 10x ($0.30/MTok cached vs. $3.00/MTok uncached with Claude Sonnet). Their average input-to-output token ratio was 100:1, meaning the vast majority of computational cost comes from processing context, not generating responses.

In practice: Without compression, caching, and selective loading, inference cost grows super-linearly with the number of agent steps because each step resubmits the entire accumulated context. Vishnyakova recounts that when her team calculated the unit economics of a SaaS subscription with built-in analyst agents, inference cost without context optimization made the product uncompetitive. “It turned out that context engineering is a condition of economic viability, not a product add-on.”

For practitioners: If your agent’s input-to-output token ratio is anywhere near 100:1 and you are running on uncached tokens, context architecture is your single largest cost lever. Compression, caching, and selective loading are not optimizations; they are prerequisites for viable unit economics.

5. Provenance: Traceable to Source, Timestamp, and Trust Level

Every element of context must be traceable to which system produced it, when, and with what trust level. Without provenance, neither auditing agent decisions nor debugging errors nor regulatory compliance is possible.

In practice: Vishnyakova’s multi-agent system logged an incorrect decision, but without provenance metadata, determining which specific context fragment caused it was impossible. Tomasev et al. (2026) propose transitive accountability via attestation: in a delegation chain A to B to C, agent B signs a cryptographic report on agent C’s work and passes it to agent A, creating a chain of verifiable signatures. This makes every context element traceable to its source through cryptographic proof.

Where the Paper and the Series Converge

The mapping between the paper’s five criteria and the blog’s three-part framework is neither accidental nor forced. Both started from the same structural observation (context quality determines agent quality) and built toward the same conclusion (automated quality checks are necessary but insufficient without human judgment or corporate intent). The following comparison distinguishes what the paper directly states, what the blog argued earlier, and what this article synthesizes across both.

Mapping Vishnyakova's five context quality criteria to the blog's six Data Quality dimensions and three-part series architecture

The Blog’s Six DQ Dimensions and the Paper’s Five Criteria

The first article mapped six DAMA-DMBOK Data Quality dimensions to the context window: Accuracy, Completeness, Timeliness, Consistency, Validity, and Uniqueness. Here is how they align with Vishnyakova’s five criteria.

Blog Dimension	Paper Criterion	Relationship
Accuracy	Sufficiency	Both address whether the agent is reasoning on correct, complete information. Sufficiency explicitly frames data gaps as a hallucination trigger.
Completeness	Sufficiency + Relevance	Completeness asks “did the tool return enough?” Sufficiency asks the same. Relevance adds the inverse: “was irrelevant data excluded?”
Timeliness	(implicit)	The blog treats freshness as an explicit dimension. The paper folds it into Relevance (stale data is irrelevant data) and Provenance (timestamps as metadata).
Consistency	(implicit in context clash)	The blog’s contradiction detection maps to the paper’s “context clash” degradation mode, but the paper does not elevate it to a standalone quality criterion.
Validity	(implicit in architecture)	The blog’s schema validation dimension. The paper’s discussion of Google ADK’s “processor pipeline” (compression, filtering, enrichment) implies schema validation but does not name it separately.
Uniqueness	Relevance + Economy	Deduplication reduces both irrelevant repetition (Relevance) and wasted tokens (Economy).
(not in blog)	Isolation	New contribution. The blog’s framework covers single-agent tool-calling. Isolation addresses multi-agent boundary protection.
(not in blog)	Economy	New contribution. The blog mentions token budget briefly under Uniqueness. Economy elevates cost to a first-class engineering constraint.
(not in blog)	Provenance	Partially in blog. The blog’s Part 2 proposes “Source Reliability Scoring” and “Confidence Scoring.” Provenance formalizes this with cryptographic attestation chains and regulatory compliance requirements.

The overlap is significant. Six of nine total concepts (across both frameworks) have direct or partial mappings. The differences are complementary, not contradictory: the blog provides more implementation detail on Timeliness, Consistency, and Validity; the paper provides more formal framing on Isolation, Economy, and Provenance.

Context Quality Layer Meets Context-as-OS

The second article proposed a Context Quality Layer: a validation gate between tool results and the context window running six checks (source reliability scoring, freshness validation, contradiction detection, schema validation, completeness checks, confidence scoring).

Vishnyakova’s five criteria are the engineering specification for what that gate should enforce. The blog identified the architectural gap at Boundary 3 (tool result entering context window with no semantic validation). The paper names the properties that close it.

The blog called the context window “a data pipeline.” The paper calls it “the agent’s operating system.” Both framings lead to the same design requirement: engineered quality controls at the boundary where external data enters the agent’s decision-making environment. The OS framing is stronger because it captures the additional responsibilities of memory management, process isolation, and resource allocation that a pipeline metaphor does not.

Judgment-in-the-Loop Meets Intent Engineering

The third article defined judgment-in-the-loop as the irreplaceable human capability of recognizing when AI output looks right but is wrong. The five responsibilities: Evaluate, Validate, Correct, Guide, Decide.

The paper directly validates this from the opposite direction. Vishnyakova’s argument: an agent with impeccably designed context (relevant, sufficient, isolated, economical, traceable) can still fail strategically. It can “optimize the wrong metric, sacrifice the wrong value, pursue efficiency at the expense of the corporate goal it was meant to serve.” This is an intent failure, not a context failure.

Reading both together, the paper’s Intent Engineering is the organizational formalization of what this blog’s series calls judgment-in-the-loop. Where the blog frames it as a human capability (domain expertise, institutional memory, the ability to recognize “almost right”), the paper frames it as an engineering discipline (encoding goals, priorities, trade-off hierarchies into agent infrastructure). Both are correct. Judgment-in-the-loop describes the human input. Intent Engineering describes the infrastructure that receives, encodes, and enforces that input.

The Klarna case study in the paper illustrates the connection precisely. Klarna’s AI assistant handled two-thirds of customer service chats in its first month in early 2024. Context Engineering appeared to be working: the agent had data and could resolve tickets. But the corporate intent (the balance between cost savings and customer loyalty, the brand’s target NPS, the hierarchy of trade-offs in service situations) was never formalized. The agent optimized cost per token, not the value of customer relationships. In May 2025, Klarna publicly reversed course and began restoring human customer-service access. Even so, by the Q3 2025 earnings call, Klarna reported the system had done work equivalent to 853 full-time agents and delivered roughly $60 million in savings, because the AI continued handling volume even as the company rebalanced toward human service.

The blog series used a different case (Air Canada) to make the same structural argument. Both demonstrate that technically competent agents fail when the judgment that should govern their behavior is not encoded into the system.

What the Paper Adds

Three contributions from the paper extend beyond the blog’s framework.

Isolation as a first-class quality criterion. The blog’s framework addresses single-agent tool-calling. Enterprise deployments increasingly use multi-agent systems where sub-agents handle different aspects of a task. Isolation formalizes the requirement that sub-agents must not see each other’s full context. Without it, data leakage between roles creates both controllability failures (an agent making decisions based on data it was not designed to interpret) and security failures (agents gaming visible test data). The A2A protocol and Delegation Capability Tokens are emerging infrastructure for enforcing isolation at the protocol level.

Economy as an engineering constraint. The blog mentions token budget under Uniqueness but treats it as secondary. The paper elevates Economy to a first-class criterion, backed by concrete data. Manus’s 10x cost differential between cached and uncached tokens, combined with a 100:1 input-to-output ratio, means context architecture decisions dominate unit economics. For products with built-in agent capabilities, Economy is not optimization; it is viability.

The four-level maturity pyramid. The blog’s series covers Levels 2 and 3 of the paper’s pyramid (Context Engineering and judgment-as-intent). The paper adds Level 4, Specification Engineering: creating a machine-readable corpus of corporate policies, quality standards, and operational procedures that governs agents across an entire organization. Vishnyakova draws a precise analogy: “Specifications for agents are what ERP (Enterprise Resource Planning) is for business processes: ERP runs on codified procedures, not verbal agreements. Multi-agent systems require the same formalization, applied to corporate knowledge.”

The TELUS case illustrates the scale at which this matters. By late 2025, TELUS Digital said Fuel iX had been rolled out to 70,000 team members and had led to more than 21,000 custom copilots. In early 2026, the company reported its platform had processed over 2 trillion tokens in 2025. Without specification-level governance, that many independently configured agents risk diverging in behavior, accumulating conflicting priorities, and producing mutually contradictory decisions.

What the Blog Adds

The paper is a formal framework. The blog provides three things the paper does not.

Explicit Timeliness and Consistency as separate dimensions. The paper treats staleness as a sub-case of Relevance and contradiction as a degradation mode (context clash) rather than a quality criterion. The blog separates both into named, measurable dimensions. This matters for implementation: a team building a Context Quality Layer needs separate validation rules for “is this data current?” (Timeliness) and “do these two tool results contradict each other?” (Consistency). Folding them into broader criteria risks under-specifying the checks.

The compound error math. The blog’s 0.85^10 = 0.20 calculation (an 85%-accurate agent fails 80% of the time on a 10-step task) makes the abstract argument concrete in a way that resonates with engineering and leadership audiences. The paper describes degradation qualitatively (“degradation over long horizons,” “the model begins to get lost in the middle”) but does not quantify the compounding effect.

The four-boundary architecture model. The blog’s Part 2 mapped four specific data-flow boundaries inside an agent (User Prompt to Agent, Agent to Tool Call, Tool Result to Context Window, Context Window to Response) and showed that three have validation while one does not. This architectural specificity makes the gap actionable: teams know exactly where to insert the quality gate. The paper describes the need for quality criteria without specifying the architectural insertion point.

The Uncomfortable Overlap

Both the paper and the blog cite the same evidence base:

The MSR/Salesforce study showing a 39% quality drop from single-turn to multi-turn conversations
Karpathy’s analogy of the context window as “the LLM’s RAM”
The Gemini 2.5 Pokemon context poisoning (hallucinated game state persisting in context)
Breunig’s context rot taxonomy
The LangChain context engineering framework (write, select, compress, isolate)

This is not a flaw. When independent analyses converge on the same evidence, it signals that the field is consolidating. The observations are no longer isolated findings; they are forming a consensus. Context quality as an engineering discipline is not one person’s opinion. It is where the field is headed.

The enterprise data confirms the urgency. According to Deloitte (2026), close to three-quarters of enterprises plan agentic AI deployment within two years, but only 21% have a mature governance model. KPMG’s tracking captures what happens without governance: agent deployment rose from 11% in Q1 2025 to 42% in Q3, then registered at 26% in Q4 as organizations shifted from rapid growth to more controlled scaling.

Gartner predicts 40% of enterprise apps will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025. The gap between deployment velocity and governance maturity is the central problem both the paper and this blog are trying to close.

The Integrated Framework

Reading the paper alongside the blog series produces a more complete framework than either offers alone.

For context quality checks (automated, at Boundary 3): use the blog’s six DQ dimensions (Accuracy, Completeness, Timeliness, Consistency, Validity, Uniqueness) as the implementation checklist, supplemented by the paper’s Isolation and Economy criteria for multi-agent and cost-sensitive deployments. Nine quality properties in total.

For governance structure (organizational, above automated checks): use the paper’s maturity pyramid. Level 2 (Context Engineering) covers the technical quality layer. Level 3 (Intent Engineering) formalizes the blog’s judgment-in-the-loop as corporate infrastructure. Level 4 (Specification Engineering) extends governance to the enterprise scale.

For the human role: the blog’s judgment-in-the-loop framework (Evaluate, Validate, Correct, Guide, Decide) describes what the person brings. The paper’s Intent Engineering describes how to encode that judgment into the system so it persists when the person is not in the loop.

What this looks like in practice. Nine quality properties for automated context checks (six from this blog, plus Isolation, Economy, and Provenance from the paper). The paper’s maturity pyramid for governance structure. The blog’s judgment-in-the-loop framework for the human role. Together, these form the complete agent quality stack that neither framework provides alone.

Do Next

Priority	Action	Why it matters
This week	Read the paper yourself: arXiv 2603.09619, 25 pages. Focus on Sections 9, 12, and 17 (quality criteria, Klarna case, maturity pyramid).	Primary sources are always better than summaries. The paper is written accessibly.
This week	Audit your agent stack for Isolation violations. Do sub-agents in your multi-agent workflows see data they should not?	Isolation is the paper’s most novel contribution and the most commonly violated criterion in enterprise agent deployments.
This month	Add Provenance metadata to tool results entering the context window: source system, timestamp, trust level.	Without provenance, you cannot audit agent decisions, debug errors, or satisfy regulatory requirements. This is table stakes for enterprise deployment.
This month	Calculate the Economy of your agent’s context. What is your cached vs. uncached token ratio? What does each agent step cost?	Context architecture is product economics. If your 100:1 input-to-output ratio runs on uncached tokens, you are paying 10x more than you should.
This quarter	Map your organization against the maturity pyramid. Are you at Level 1 (prompt craft), Level 2 (designed context), Level 3 (encoded intent), or Level 4 (machine-readable specifications)?	The level at which your company has stopped is the measure of your agentic infrastructure maturity. Most organizations are between Level 1 and Level 2.
This quarter	Identify the judgment-in-the-loop owners for each agent-augmented workflow. Who defines the trade-off hierarchies? Who decides what “correct” means for this domain?	Intent Engineering requires cross-functional collaboration. Technical teams cannot define corporate priorities alone.