Observability: Seeing What Your Agent Actually Does
Your monitoring says 200 OK. The agent returned the wrong answer. Traditional APM was designed for deterministic software. Agents reason, branch, and call tools in sequences they decide at runtime. This article covers the five dimensions of agent observability, the tooling landscape, and a practical instrumentation plan.
Part 9 of 12: The Practitioner’s Guide to AI Agents
The Observability Gap
You built the agent. You wrote evals (Article 7). You added guardrails (Article 8). The agent works in testing. Then it hits production.
A user submits a query. The agent reasons for twelve seconds, calls three tools, consumes 47,000 tokens, and returns a confident answer. The answer is wrong. Your monitoring dashboard shows a 200 response with 1.2-second latency. By every traditional metric, everything is fine.
This is the observability gap. Your infrastructure monitoring tells you the system ran. It cannot tell you what the agent did, why it chose that path, or where the reasoning broke down. Traditional application performance monitoring (APM) was designed for deterministic software: a request comes in, code executes the same path every time, a response goes out. Agents do not work that way. They reason, branch, call tools in sequences they decide at runtime, and produce different outputs for identical inputs. Monitoring an agent with standard APM is like monitoring a conversation with a packet sniffer. You see the bytes. You miss the meaning.
I ran into this gap on a data pipeline agent I was evaluating. The agent ran successfully every time. Latency was fine. Error rate was zero. But the outputs were subtly wrong in 15% of runs: it was pulling the right data from the right tables but applying an outdated join condition from an earlier version of the schema. The monitoring said “healthy.” The agent was silently producing incorrect results. It took two weeks of manual output review to find it. With step-level tracing, it would have taken five minutes.
This article covers what to observe, how to observe it, and which tools exist to help. It sits between guardrails and self-improvement because observability is what connects safety architecture to operational reality. Guardrails prevent known failure modes. Observability reveals the ones you did not anticipate.
Why Traditional Monitoring Fails for Agents
The gap between traditional monitoring and agent observability comes down to three properties that agents have and conventional software does not.
Non-determinism. The same input can produce different outputs across runs. A traditional service returns the same result for the same parameters. An agent’s response depends on the model’s sampling temperature, the state of external tools, the contents of retrieved documents, and the accumulated context from prior steps. You cannot write a unit test that asserts on exact output and expect it to pass consistently.
Multi-step reasoning. An agent does not execute a single function. It runs a loop: observe, think, act, repeat. Each iteration builds on the previous one. A failure in step three might not surface until step seven, when corrupted context produces a hallucinated conclusion. Traditional request-response tracing captures the start and end of a call. It does not capture the reasoning chain between them.
Dynamic tool orchestration. The agent decides which tools to call, in what order, with what parameters. This is not a static DAG like a data pipeline. The execution graph is determined at runtime by the model. Two identical user queries might trigger completely different tool sequences depending on what the first tool returns. You need traces that capture the full decision tree, not just the endpoints.
IBM’s analysis of agent observability frames the distinction clearly: traditional monitoring asks “did the system run?” Agent observability asks “did the system reason correctly?” These are fundamentally different questions, and they require fundamentally different instrumentation.
LangChain’s 2025 State of Agent Engineering survey of 1,340 practitioners quantifies the gap: 89% have some form of observability in place, but only 52% run offline evals. Among teams with agents in production, 94% have observability, and 71.5% have detailed step-level tracing. The rest can see that the agent responded. They cannot see how it got there.
Gartner projects that by 2028, 60% of software engineering teams will adopt AI evaluation and observability platforms, up from 18% in 2025. The projection is aggressive, but the direction is not debatable. You cannot operate what you cannot see.
The Five Dimensions of Agent Observability
Traditional monitoring has three pillars: logs, metrics, and traces. Agent observability needs five dimensions that map to the unique failure modes of agentic systems.
1. Execution Tracing
This is the backbone. Every agent run should produce a trace that captures each step in the reasoning loop: the system prompt, the user input, each LLM call and its response, each tool call with its parameters and result, and the final output. The trace should be hierarchical: a parent span for the full agent run, child spans for each reasoning step, and nested spans for tool calls within steps.
What this looks like in practice. Without execution tracing, debugging an agent failure means reading logs and guessing. With it, you can replay the agent’s decision path step by step and identify exactly where reasoning diverged from intent.
The OpenTelemetry GenAI semantic conventions are converging on a standard schema for this. The gen_ai.operation.name attribute uses invoke_agent for agent invocations, with standard attributes for model name, token counts, and tool metadata. The conventions are still in development status, but they represent the direction the industry is moving: agent traces that are interoperable across vendors and tools.
2. Token Economics
Agents are expensive. A single user request can trigger planning, tool selection, execution, verification, and response generation, easily consuming 3 to 10 times the tokens of a simple chat completion. An unconstrained agent solving a software engineering task can cost $5 to $8 per run in API fees alone.
Token economics monitoring tracks four metrics per agent run:
- Input tokens: how many tokens the agent consumed reading context
- Output tokens: how many tokens the agent generated (typically 4 to 8 times more expensive per token than input)
- Cache hit rate: what percentage of input tokens were served from prompt cache (Anthropic’s prompt caching reduces costs by 90% on cache hits and latency by up to 85%)
- Cost per completion: the dollar amount for each agent run, broken down by step
96% of enterprises report AI costs exceeding initial projections, and only 44% have financial guardrails for AI spending. If you are not tracking token economics per agent run, you are budgeting by hope. The emerging discipline of Agent FinOps recommends three layers of cost governance: per-action limits, per-agent budgets, and fleet-level throttling.
3. Tool Call Monitoring
Tools are the agent’s interface with the real world. When an agent fails, the root cause is often a tool failure that the agent misinterpreted or worked around incorrectly. Tool call monitoring captures:
- Which tools were called and in what sequence
- Latency per tool call: a slow API response might cause the agent to time out or truncate results
- Success/failure rate: did the tool return valid data, an error, or an empty result?
- Parameter validation: did the agent call the tool with sensible parameters, or did it hallucinate an API endpoint?
How to build the check. The distinction between tool failures and reasoning failures is critical for debugging. A tool failure means the external service broke. A reasoning failure means the agent received valid data and drew the wrong conclusion. These require different fixes: one is an infrastructure problem, the other is a prompt engineering problem. Without tool-level tracing, you cannot tell which one you are dealing with.
4. Context Window Health
Article 6 argued that context is the program. If that is true, then monitoring context window health is monitoring the program’s source code at runtime.
Context window health tracks:
- Utilization percentage: how full is the context window? An agent running at 95% capacity is one more tool response away from truncating critical information.
- Composition breakdown: what proportion of the context is system prompt, conversation history, tool results, and retrieved documents? A context window that is 60% boilerplate and 10% relevant data has an efficiency problem.
- Injection quality: are the documents being retrieved by RAG actually relevant to the query? A context window full of irrelevant chunks produces confident, well-sourced, wrong answers.
Anthropic’s token counting API lets you measure context size before sending a request, which is free to use. This enables pre-flight checks: if the context exceeds a threshold, compress, summarize, or drop the lowest-priority chunks before making the LLM call.
5. Behavioral Drift Detection
Model providers update models without notice. A January 2026 paper on agent drift introduced the Agent Stability Index (ASI), a composite metric across twelve dimensions including response consistency, tool usage patterns, and reasoning pathway stability. The paper found that semantic drift, defined as progressive deviation from original intent, occurs in nearly half of multi-agent workflows by 600 interactions.
Drift manifests in three forms:
- Semantic drift: the agent’s outputs gradually shift in meaning or tone
- Coordination drift: in multi-agent systems, consensus between agents degrades over time
- Behavioral drift: the agent develops strategies or patterns not present in its original instructions
The agent-era restatement: If you assume “frozen” model versions remain static, drift will surprise you. GPT-4o behavioral changes were reported with zero advance notice by developers on r/LLMDevs. Detection requires baselines: record behavior when the agent works correctly, then continuously compare production behavior against those baselines.
What to Log, What to Alert On
Not everything worth logging is worth alerting on. Over-alerting creates noise that teams learn to ignore. Under-alerting lets failures accumulate silently.
Log everything, alert selectively. Every agent run should produce a full trace: inputs, reasoning steps, tool calls, outputs, token counts, latency. Storage is cheap. Debugging time is not.
Alert on these five signals:
| Signal | Threshold | Why it matters |
|---|---|---|
| Error rate | > 5% of runs in a rolling hour | Tool failures or model errors are spiking |
| Cost per run | > 2x the 7-day rolling average | The agent is in a reasoning loop or calling tools excessively |
| Latency P95 | > 3x the baseline P95 | Slow tool responses or model degradation |
| Context utilization | > 90% of window capacity | One more tool response could truncate critical context |
| Drift score | > 2 standard deviations from baseline | The agent’s behavior has shifted meaningfully |
Do not alert on these: individual tool call failures (tools fail; handle it in the agent’s retry logic), minor token count variations (normal), or single slow responses (transient). Alert on patterns, not incidents.
The Tooling Landscape
The agent observability market has matured rapidly. Here is a vendor-neutral overview of the major categories, current as of early 2026.
The Current Tooling
The market splits into open-source and commercial platforms. On the open-source side: Langfuse (MIT, full-featured tracing + evaluation + prompt management, OpenTelemetry-native), Arize Phoenix (fully open, strongest for local debugging and agent evaluation), and OpenLLMetry (not a platform but an instrumentation layer that sends AI-specific traces to your existing stack: Datadog, Grafana, Jaeger). On the commercial side: LangSmith (low-overhead tracing, best with LangChain ecosystem), Arize AX (PCI DSS compliance, data lake integration), and Braintrust (best-in-class token economics and cost tracking).
The Convergence on OpenTelemetry
The most important trend in agent observability tooling is not any single platform. It is the convergence on OpenTelemetry as the standard telemetry format. The GenAI semantic conventions define standard attributes for model calls, tool invocations, token counts, and agent operations. Amazon, Elastic, Google, IBM, Microsoft, and others are contributing to this specification.
For practitioners: Instrument your agent with OpenTelemetry today, and you can switch observability backends without re-instrumenting your code. This is the same portability promise that OpenTelemetry delivered for traditional distributed tracing, now extended to AI workloads.
Building Your Observability Layer
If you are starting from zero, here is the order of implementation. Each step builds on the previous one.
Week 1: Structured logging. Add structured JSON logging to your agent loop before adopting any platform. Every iteration logs step number, action, token counts, latency, and status:
import json, time, logging
logger = logging.getLogger("agent")
def log_step(run_id: str, step: int, action: str, tokens_in: int,
tokens_out: int, latency_ms: float, status: str):
logger.info(json.dumps({
"run_id": run_id,
"step": step,
"action": action,
"tokens_in": tokens_in,
"tokens_out": tokens_out,
"latency_ms": round(latency_ms, 2),
"status": status,
"timestamp": time.time()
}))
Week 2: OpenTelemetry instrumentation. Add OpenTelemetry spans to your agent loop. Use OpenLLMetry or manual instrumentation. Each agent run becomes a parent span. Each reasoning step and tool call becomes a child span. Export to whatever backend you already use for tracing (Jaeger, Grafana Tempo, Datadog) or start with the OpenTelemetry Collector writing to local files.
Week 3: Cost tracking. Parse token counts from every LLM response and calculate cost per run using your provider’s pricing. Build a daily rollup: total spend, average cost per run, top-10 most expensive runs. Anthropic’s response objects include usage.input_tokens, usage.output_tokens, and usage.cache_read_input_tokens. Multiply by the per-token price for your model tier.
Week 4: Dashboards and alerts. Build three dashboards:
- Agent health: error rate, P50/P95 latency, runs per hour, success rate
- Token economics: cost per run (trend), daily spend, cache hit rate, cost by agent type
- Behavioral baseline: average tool calls per run, average steps per run, token consumption distribution
Set up alerts using the five signals from the table above. Start with generous thresholds and tighten them as you learn what normal looks like for your agents.
Month 2: Drift detection. After you have four weeks of baseline data, implement drift monitoring. Compare the current week’s metrics against the four-week rolling average. Flag runs that deviate by more than two standard deviations on any key metric. This does not require a dedicated drift detection tool. A scheduled script that queries your trace data and computes z-scores is sufficient to start.
The Connection to Evals and Guardrails
Observability does not replace evals or guardrails. It completes the feedback loop.
Evals (Article 7) tell you whether the agent’s output is correct. They run offline against test cases and catch quality regressions before deployment.
Guardrails (Article 8) prevent known failure modes in real time. They validate inputs, constrain reasoning, and filter outputs.
Observability tells you what is actually happening in production. It catches the failures that evals did not anticipate and guardrails did not prevent. It generates the data that makes your next round of evals and guardrails more targeted.
The virtuous cycle works like this: observability surfaces a new failure pattern. You write an eval that catches it. If the failure is preventable, you add a guardrail. The eval verifies the guardrail works. Observability confirms it holds in production. Repeat.
Teams that adopt comprehensive evaluation and observability together achieve 2.2 times better reliability than teams that rely on one without the other. That number is not surprising. You cannot improve what you cannot see, and you cannot verify improvements without measurement.
Do Next
| Priority | Action | Why it matters |
|---|---|---|
| No experience | Add structured JSON logging to one agent’s reasoning loop. Log step number, action, token count, and latency for each iteration. | You cannot debug agent failures from HTTP status codes. Structured logs give you the minimum viable observability for free. |
| No experience | Calculate the cost of your last 100 agent runs by multiplying token counts by your provider’s per-token price. | Most teams do not know what their agents cost. The number is usually higher than expected. |
| Learning | Instrument your agent with OpenTelemetry using OpenLLMetry or your framework’s native integration. Export traces to a local backend. | Traces let you replay an agent’s full decision path. This is the single most valuable debugging tool for agentic systems. |
| Learning | Set up Langfuse or Arize Phoenix and route your agent traces to it. Explore the trace viewer for five failed runs. | Seeing step-by-step reasoning in a visual trace viewer changes how you think about agent debugging. Start with failures; they teach you the most. |
| Practitioner | Build the three dashboards described in this article (agent health, token economics, behavioral baseline) and set up the five alerting signals. | Dashboards turn observability data into operational awareness. Alerts turn awareness into response time. |
| Practitioner | After four weeks of baseline data, implement drift detection by comparing weekly metrics against rolling averages. Flag two-standard-deviation deviations. | Model providers change behavior without notice. Drift detection is your early warning system for silent degradation. |
This is Part 9 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Guardrails and Safety · Next: Multi-Agent Systems →
Sources & References
- IBM: AI Agent Observability(2026)
- LangChain: 2025 State of Agent Engineering Survey(2025)
- Gartner: Explainable AI Will Drive LLM Observability Investments(2026)
- OpenTelemetry GenAI Semantic Conventions(2025)
- Zylos Research: AI Agent Cost Optimization and Token Economics(2026)
- Cordum: Agent FinOps and Token Cost Governance(2026)
- Agent Drift Study: Agent Stability Index(2026)
- Braintrust: Best AI Observability Tools 2026(2026)
- Langfuse: Open-Source LLM Observability(2025)
- Arize Phoenix: Open-Source AI Observability(2025)
- OpenLLMetry: OpenTelemetry for AI(2025)
- Anthropic: Prompt Caching(2025)
- Anthropic: Token Counting API(2025)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.