Design

Pattern 25 of 26

Observability and Tracing

A system you cannot observe is a system you cannot fix

An agent ran 14 steps, returned the wrong answer, and I had no idea where it went wrong. That is the problem observability solves. Traces capture every tool call, reasoning step, and decision point so you can replay exactly what happened. Without that, debugging is just reading the final output and guessing at the middle.

Why it matters

Agents are non-deterministic and multi-step. The same input can produce different paths on different runs. You cannot reproduce failures from the output alone. If you cannot trace what the agent actually did, you are not debugging, you are guessing at a system you do not understand.

Deep Dive

The failure mode I hit most often is not a crash. It is a subtly wrong result that the agent produced confidently. To find where the reasoning broke down I need to see the full chain: what the agent was thinking at each step, which tool it called and why, what that tool returned, and how the agent interpreted that return before moving on. A log that says "tool returned 200" does not give me that. A structured trace span that captures arguments in, data out, and the reasoning step that followed does. That distinction is the whole game.

OpenTelemetry extended its semantic conventions for generative AI in 2024, which gave the ecosystem a common schema for LLM spans. Model name, prompt, completion, token counts, latency, tool calls, retrieval steps. Langfuse, Arize Phoenix, and Braintrust all build on this foundation. Langfuse is open source and self-hostable, which matters if you want to keep trace data on your own infrastructure. Phoenix is built on the full OpenTelemetry stack. Braintrust connects traces directly to eval workflows, so you can go from "this step looks wrong" to "here is a structured test for it" without switching tools.

The thing that took me a while to understand is that semantic structure matters more than volume. You can log everything and still know nothing if the data is not organized around the decisions the agent was making. A well-structured trace is organized by reasoning step, not by function call. It shows the agent's intent, the tool it chose to fulfill that intent, and the consequence of that choice. The frameworks getting traction are the ones that make this easy to emit without wiring up every method call by hand. Instrumenting agent code manually is a way to burn a week and still end up with incomplete traces.

In the Wild

Langfuse (open-source, self-hostable)
Braintrust (1M spans/month free)
Arize Phoenix (OpenTelemetry)
Cloudflare AI Gateway

Go Deeper

DOCSOpenTelemetry Semantic Conventions for Generative AIARTICLEAn Introduction to Observability for LLM-based ApplicationsARTICLEAI Agent Observability, Tracing and Evaluation with LangfuseDOCSCloudflare AI Gateway

Related Patterns