Design
Pattern 25 of 26
Observability and Tracing
A system you cannot observe is a system you cannot fix
An agent ran 14 steps, returned the wrong answer, and I had no idea where it went wrong. That is the problem observability solves. Traces capture every tool call, reasoning step, and decision point so you can replay exactly what happened. Without that, debugging is just reading the final output and guessing at the middle.
Why it matters
Agents are non-deterministic and multi-step. The same input can produce different paths on different runs. You cannot reproduce failures from the output alone. If you cannot trace what the agent actually did, you are not debugging, you are guessing at a system you do not understand.
Deep Dive
The failure mode I hit most often is not a crash. It is a subtly wrong result that the agent produced confidently. To find where the reasoning broke down I need to see the full chain: what the agent was thinking at each step, which tool it called and why, what that tool returned, and how the agent interpreted that return before moving on. A log that says "tool returned 200" does not give me that. A structured trace span that captures arguments in, data out, and the reasoning step that followed does. That distinction is the whole game.
OpenTelemetry extended its semantic conventions for generative AI in 2024, which gave the ecosystem a common schema for LLM spans. Model name, prompt, completion, token counts, latency, tool calls, retrieval steps. Langfuse, Arize Phoenix, and Braintrust all build on this foundation. Langfuse is open source and self-hostable, which matters if you want to keep trace data on your own infrastructure. Phoenix is built on the full OpenTelemetry stack. Braintrust connects traces directly to eval workflows, so you can go from "this step looks wrong" to "here is a structured test for it" without switching tools.
The thing that took me a while to understand is that semantic structure matters more than volume. You can log everything and still know nothing if the data is not organized around the decisions the agent was making. A well-structured trace is organized by reasoning step, not by function call. It shows the agent's intent, the tool it chose to fulfill that intent, and the consequence of that choice. The frameworks getting traction are the ones that make this easy to emit without wiring up every method call by hand. Instrumenting agent code manually is a way to burn a week and still end up with incomplete traces.