Prompt Engineering for Production Agents
Production agents need prompts that produce consistent, structured output under adversarial conditions. This article covers the five patterns that separate production prompt engineering from tutorial-grade prompting: explicit criteria, few-shot examples, nullable fields, enum-with-fallback, and output format contracts.
Companion article: The Practitioner’s Guide to AI Agents
The Specification Problem
I watched a team spend two weeks debugging an extraction agent that “randomly” miscategorized support tickets. The model was fine. The tools worked. The context was clean. The problem was in the system prompt: “Categorize the ticket as billing, technical, or other.”
“Other” was doing all the wrong work. Feature requests went to “other.” Account access issues went to “other.” Complaints about missing features went to “technical” half the time and “other” the rest. The model had no specification for what distinguished the categories, so it applied its own judgment, and that judgment was inconsistent across runs.
The fix took fifteen minutes: replace “categorize the ticket” with explicit criteria defining each category, add two examples showing ambiguous cases, and specify what “other” actually meant. Pass rate went from 72% to 96% overnight.
This is the pattern behind most production prompt failures. The model is not wrong. The specification is incomplete. Tutorial prompts work because the task is simple. Production prompts fail because the task has edges, ambiguities, and constraints that the prompt does not encode.
This article covers the five patterns that close the gap between tutorial prompting and production prompt engineering for agents.
Pattern 1: Explicit Criteria, Not Vague Instructions
The most common prompt failure in production agents is vague evaluation criteria. “Check if the output is accurate” gives the model no standard to apply. “Flag inaccurate comments” sounds specific but is not: what counts as inaccurate? How wrong does something need to be before it gets flagged?
Vague criteria produce two problems. First, inconsistency: the same input gets different judgments across runs because the model applies a different internal threshold each time. Second, uncalibratable: you cannot tune a threshold you did not define.
The fix is categorical criteria. Replace subjective judgment with testable conditions.
# Vague (inconsistent across runs)
Flag any inaccurate code comments.
# Explicit (testable, consistent)
Flag a comment ONLY when one of these conditions is true:
1. CONTRADICTS_CODE: The comment describes behavior that directly
contradicts what the code does.
Example: Comment says "returns null on failure" but the function
raises ValueError.
2. STALE_REFERENCE: The comment references a variable, function, or
class that no longer exists in the current scope.
Example: Comment says "calls validate()" but validate() was removed.
3. MISLEADING_TYPE: The comment states a type that conflicts with
the actual type annotation or runtime type.
Example: Comment says "accepts a list" but the parameter is typed
as dict.
Do NOT flag: comments that are vague but not wrong, comments that
describe intent rather than implementation, or TODO/FIXME markers.
The explicit version does four things the vague version does not. It defines categories (what kinds of inaccuracy matter). It provides examples anchoring each category (what a real violation looks like). It specifies exclusions (what should not be flagged). And it uses “ONLY when” framing to set a conservative default: when in doubt, do not flag.
This matters for agent reliability because agents make hundreds of decisions per session. A 5% inconsistency rate on individual decisions compounds across a session. Over 50 decisions, that is a near-certainty that at least one will be wrong. Explicit criteria reduce the per-decision error rate, and the compounding math rewards even small improvements.
The False Positive Problem
Explicit criteria also control false positives. This matters more than most teams realize.
Below a 5% false positive rate, users trust the system. Flags get investigated. The validation adds value. Above 15%, users start ignoring flags. “It always complains about something” becomes the reflex, and real issues get lost in the noise.
If your agent’s validation or classification produces more than 15% false positives, the problem is your criteria, not the model. Tighten the categories. Add exclusions. Accept that catching 80% of real issues with low noise is better than catching 95% of real issues while flooding the output with false alarms.
Pattern 2: Few-Shot Examples for Ambiguous Cases
Explicit criteria handle the clear cases. Few-shot examples handle the edges: inputs where reasonable people (or reasonable models) might disagree about the right answer.
When to Use Few-Shot
Few-shot examples are most effective in four situations:
- Format demonstration: Showing the exact output structure you want (JSON shape, field names, nesting).
- Ambiguous case handling: Demonstrating edge cases where the category is not obvious from the criteria alone.
- Extraction boundaries: Showing what to extract AND what to leave out.
- Cross-category disambiguation: Showing that a feature request disguised as a complaint is still a feature request.
How Many Examples
| Count | When to Use |
|---|---|
| 0 (zero-shot) | Task is unambiguous, format is simple |
| 1 | Format demonstration only, no edge cases |
| 2-3 | Standard. One normal case plus one or two edge cases |
| 4-5 | Complex taxonomy with overlapping categories |
| 6+ | Your criteria need reworking, not more examples |
The signal to watch: if you keep adding examples because the model still gets edge cases wrong, stop. Six or more examples almost always means the underlying criteria are unclear. Fix the descriptions first, then see if 2-3 examples are sufficient.
What Good Examples Show
Each example in a few-shot set should teach the model something it cannot learn from the criteria alone.
## Example 1: Standard billing inquiry
Input: "I was charged twice for my March subscription"
Output: {"category": "billing", "urgency": "medium",
"reason": "Duplicate charge on subscription renewal"}
## Example 2: Feature request disguised as complaint
Input: "Your app is terrible because it doesn't support dark mode"
Output: {"category": "feature_request", "urgency": "low",
"reason": "Requesting dark mode; complaint is about missing feature,
not a defect"}
# NOTE: This is a feature_request, not "other" or "technical"
## Example 3: Access issue with technical symptoms
Input: "I get a 500 error when I try to reset my password"
Output: {"category": "account_access", "urgency": "high",
"reason": "Password reset flow broken; user goal is account access,
not reporting a technical bug"}
# NOTE: Categorize by the user's goal, not the technical symptom
Example 1 establishes the format. Example 2 disambiguates: a complaint about a missing feature is a feature request, not a complaint. Example 3 establishes the rule: categorize by the user’s goal (account access), not the surface symptom (500 error). Without these examples, the model applies its own disambiguation rules, which vary across runs.
The Extraction Rule
For data extraction tasks, always show null for missing fields in your examples. Without this, models tend to fabricate plausible values to “be helpful.”
# Example extraction from a company profile
{
"company_name": "Acme Corp",
"founded_year": 2015,
"employee_count": None, # Not mentioned in source. Do not guess.
"headquarters_city": None # Not mentioned in source. Do not guess.
}
The None with an explicit comment (“Not mentioned, do not guess”) teaches the model that omission is acceptable. Models have a strong tendency to fill required-looking fields with plausible data. Showing None as a valid, sanctioned output reduces this tendency significantly.
Pattern 3: Nullable Fields for Required Output
Schemas for agent tool calls define required fields. When a field is required, the model must include it in every response. But what if the source data does not contain the value?
Without a nullable option, the model faces a forced choice: skip a required field (schema violation) or fabricate a value (hallucination). Both are wrong. The nullable pattern provides the correct escape hatch.
# Problem: required field forces fabrication
"employee_count": {
"type": "integer",
"description": "Number of employees"
}
# If source doesn't mention employees, model guesses "approximately 50"
# Solution: required + nullable
"employee_count": {
"type": ["integer", "null"],
"description": "Number of employees. Null if not stated in the source."
}
# Model returns null instead of guessing
The field is still required (it must appear in the output), but its value can be null. This is not the same as making the field optional. An optional field might be omitted silently, and you would not know whether the model skipped it intentionally or forgot it. A required nullable field is always present: its value tells you whether the data was found (integer) or not (null).
Where Nullable Fields Matter Most
Three scenarios demand nullable fields:
- Document extraction: Invoices, contracts, and forms vary in completeness. A vendor name is always present; an account number sometimes is not.
- Entity resolution: Matching records across systems where one system has fields the other lacks.
- Multi-source aggregation: An agent pulls data from three APIs. Each API returns a different subset of fields.
In all three cases, the alternative to nullable fields is hallucination. The model fills gaps with plausible data that looks correct in spot checks but fails in production when downstream systems act on fabricated values.
Pattern 4: Enum with Fallback
Enums constrain tool output to a fixed set of values. They are excellent for 90% of cases and terrible for the remaining 10%. When a real-world input does not fit any enum value, the model picks the closest wrong one. A payment made via PayPal gets categorized as “credit_card.” A refund for “company policy” gets mapped to “changed_mind.”
The fix: add “other” to the enum and pair it with a detail field.
{
"payment_method": {
"type": "string",
"enum": ["credit_card", "bank_transfer", "check", "cash", "other"],
"description": "Payment method used for this transaction."
},
"payment_method_detail": {
"type": ["string", "null"],
"description": (
"Required when payment_method is 'other'. Describes the "
"actual payment method in free text. Null for standard "
"payment methods."
)
}
}
This gives you the best of both worlds. The 90% of standard cases get clean, filterable enum values. The 10% of edge cases get a fallback with a free-text explanation. Downstream systems can process the standard values automatically and route “other” cases to human review or secondary classification.
When to Use This Pattern
Use enum-with-fallback when:
- The domain has a long tail of rare values (payment methods, refund reasons, error types)
- New values appear over time (a new payment provider, a new product category)
- Forcing wrong categorization would cause downstream errors
Do not use it when:
- The set of values is genuinely fixed and exhaustive (days of the week, boolean states)
- “Other” would become the most common value (your enum is missing major categories; expand it instead)
Monitor the “other” rate in production. If it exceeds 20%, your enum is stale. Mine the detail field text for new categories to add to the enum. This is how production schemas evolve: the fallback field generates the data for expanding the primary field.
Pattern 5: Fix Descriptions Before Adding Examples
This is not a sixth pattern. It is the meta-pattern that governs when to apply the others.
When an agent produces incorrect output, the instinct is to add examples. The model miscategorized a ticket? Add an example of the correct categorization. The model fabricated a field? Add an example showing null. The model chose the wrong tool? Add an example of the right tool selection.
This works up to a point. Beyond 4-5 examples, you are compensating for unclear descriptions with brute-force demonstration. The model has enough examples to pattern-match the surface structure, but it still does not understand the underlying rule because the rule was never clearly stated.
The diagnostic sequence:
- Check the criteria: Are they explicit, categorical, and testable? If “check for accuracy” is the entire instruction, no number of examples will make it consistent.
- Check the descriptions: Do tool descriptions say WHEN to use the tool, not just WHAT it does? Do field descriptions specify behavior at boundaries (empty results, null values, edge cases)?
- Check the exclusions: Does the prompt specify what NOT to do? Models over-apply rules when exclusions are not stated.
- Then add examples: 2-3 examples targeting the specific ambiguities that remain after steps 1-3.
If you reach step 4 and still need 6+ examples, go back to step 1. The criteria are not explicit enough.
This sequence applies to every prompt component in an agent system: system prompts, tool descriptions, evaluation rubrics, and handoff instructions. The descriptions are the specification. The examples are the test cases. You would not write test cases for an unspecified function. Do not write examples for unspecified criteria.
Putting It Together
These five patterns are not independent. They compose into a prompt specification for production agents.
# A production tool definition using all five patterns
CATEGORIZE_TICKET = {
"name": "categorize_support_ticket",
"description": (
# Pattern 1: WHEN to use (from Tool Design Principles)
"Use this when a new support ticket arrives and needs routing. "
"Do not use for tickets already categorized or for internal notes. "
# Pattern 2: Explicit criteria in the description
"Categorize by the user's primary GOAL, not the technical symptom. "
"A user reporting a 500 error on the login page has a goal of "
"'account_access', not 'technical_bug'."
),
"input_schema": {
"type": "object",
"properties": {
"ticket_text": {
"type": "string",
"description": "The full text of the support ticket."
},
"category": {
"type": "string",
# Pattern 4: Enum with fallback
"enum": [
"billing", "account_access", "technical_bug",
"feature_request", "data_request", "other"
],
"description": (
"Primary category. Use 'other' only when the ticket "
"genuinely does not fit any defined category."
)
},
"category_detail": {
# Pattern 3: Nullable for conditional fields
"type": ["string", "null"],
"description": (
"Required when category is 'other'. Free-text "
"description of the actual category. Null otherwise."
)
},
"urgency": {
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": (
# Pattern 1: Explicit criteria per level
"low: general question, no user impact. "
"medium: user inconvenienced but can work around. "
"high: user blocked from a core workflow. "
"critical: data loss, security incident, or outage."
)
},
"extracted_account_id": {
# Pattern 3: Nullable for extraction
"type": ["string", "null"],
"description": (
"Account ID if mentioned in the ticket. Null if "
"not present. Do not infer from other fields."
)
}
},
"required": [
"ticket_text", "category", "category_detail",
"urgency", "extracted_account_id"
]
}
}
The tool definition above uses every pattern from this article. The description says WHEN to use the tool. The criteria are explicit (categorize by goal, not symptom). The enum has a fallback. Nullable fields prevent fabrication. And the descriptions are specific enough that 2-3 few-shot examples in the system prompt would cover the remaining edge cases.
Connecting to the Series
These patterns directly support the concepts established throughout the Practitioner’s Guide.
From Article 1 (Tool Design Principles): The WHEN-to-use descriptions and verb-noun naming conventions are the foundation. This article adds the schema-level patterns that make tools produce consistent output.
From Article 4 (Structured Error Handling): Error categories and retry logic depend on structured output from tools. If tools produce inconsistent schemas, the error handling cannot classify failures reliably.
From Article 5 (Context Engineering): The quality of tool results entering the context window depends on the quality of the prompt specification. Garbage in, garbage out applies to prompts and schemas, not just data.
From Article 6 (Validation and Review): The explicit criteria pattern from this article is the same pattern used to build eval rubrics. A vague eval criterion produces the same inconsistency as a vague prompt criterion. Both need anchoring examples.
Do Next
| Priority | Action | Why it matters |
|---|---|---|
| No experience | Take one prompt you use regularly and rewrite it with explicit criteria. Replace any vague instruction (“make it accurate”) with categorical conditions (“flag ONLY when X contradicts Y”). Run both versions on the same input five times and compare consistency. | You will see the inconsistency problem firsthand. The vague prompt produces different outputs across runs. The explicit prompt converges. |
| No experience | Add two few-shot examples to a classification or extraction prompt. Make one example show a straightforward case and one show an edge case. Compare output quality before and after. | The edge case example is where few-shot earns its value. The straightforward example sets the format; the edge case teaches the model your disambiguation rules. |
| Learning | Audit your agent’s tool schemas for forced-choice fields. Find every required field that lacks a nullable type and every enum that lacks an “other” option. Add nullable and fallback where appropriate. Run 20 test cases and measure the hallucination rate before and after. | Forced-choice fields are the single largest source of silent hallucination in production agents. Each nullable field you add is a hallucination you prevent. |
| Learning | Count the few-shot examples in your longest prompt. If any section has more than 5 examples, apply the diagnostic sequence: check criteria first, then descriptions, then exclusions, then reduce to 2-3 targeted examples. | Over-exampled prompts are a symptom of unclear criteria. Fixing the root cause produces better results with fewer tokens. |
| Practitioner | Build a prompt specification template for your team: criteria section, examples section (max 3), schema section with nullable and enum-fallback patterns. Apply it to your next three agent tools. | Consistency across a team’s prompts produces consistency across agent behavior. A shared template prevents each engineer from reinventing the specification wheel. |
| Practitioner | Monitor your “other” rate and null rate in production. If “other” exceeds 20%, expand the enum. If null exceeds 50% for a field, the field may not be extractable from your sources. | Production schemas are living documents. The fallback fields generate the data you need to evolve them. |
This is a companion article in The Practitioner’s Guide to AI Agents. ← Back to the Guide
Sources & References
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.