Evaluation and Testing

How do you know your agent is good?

I upgraded a model once and a capability I relied on quietly got worse. I did not find out for a week. Evals are what prevent that. Agent evals are harder than benchmark accuracy because you need to test tool selection, step ordering, failure recovery, and end-to-end outcomes across multi-step tasks. A simple right or wrong score does not capture any of that.

Why it matters

Silent regressions are worse than loud failures. A crash surfaces immediately. An agent that produces subtly wrong results with full confidence can run for weeks before someone catches it. Evals make the regression visible at the moment you introduce the change, not after it reaches users.

Deep Dive

Evaluating an agent is not the same as evaluating a model. A benchmark that tests whether a model knows the capital of France tells you nothing about whether an agent can choose the right tool, call it with the right arguments, interpret the result, and proceed correctly when the tool returns an error. Those are behavioral properties, not knowledge properties. They need different test harnesses, different graders, and different thinking about what a passing result even means. The vocabulary matters too: task, trial, grader, transcript. Anthropic introduced this frame in their work on demystifying evals and it is useful for keeping eval infrastructure honest.

SWE-bench became the standard benchmark for coding agents because it used real GitHub issues as tasks. Fix this bug. Implement this feature. The test suite is the grader. That design sidesteps the problem of writing artificial tasks and then arguing about what the right answer is. SWE-bench Verified, released in August 2024, is a 500-problem subset where every issue was reviewed by expert engineers to confirm it is actually solvable and unambiguous. That review step matters more than it sounds. A lot of eval infrastructure fails not because the grader is wrong but because the tasks themselves have no clean answer.

The thing I got wrong early on was treating evals as something to set up after the agent was working. That is backwards. Every prompt change, every model upgrade, every new tool you add can improve one behavior and quietly break another. You do not know which until you run the eval suite. The overhead is real. Wiring up a proper grader, assembling representative tasks, and getting deterministic results from a non-deterministic system all take time. But the alternative is finding out about regressions from users, which is a worse place to be.

In the Wild

SWE-bench Verified (500 validated tasks)

LM Arena (crowdsourced evals)

Terminal-Bench (CLI agents)

Inspect AI (UK AI Safety Institute)

Go Deeper

PAPERSWE-bench: Can Language Models Resolve Real-World GitHub Issues?Princeton NLP / arXiv · 2023 PAPERChatbot Arena: An Open Platform for Evaluating LLMs by Human PreferenceLMSYS / arXiv · 2024 GUIDEDemystifying Evals for AI AgentsAnthropic Engineering · 2025 ARTICLEIntroducing SWE-bench VerifiedOpenAI · 2024 DOCSOpenAI Evals FrameworkOpenAI

Related Patterns

Observability and Tracing Guardrails and Safety Reflection and Self-Correction

All 26 patternsRead the blogHome