Agent Harnesses

The scaffolding between a model call and a working product

A harness is the scaffolding that sits between a raw model call and a working agent in production. It handles the tool-call loop, retry logic, timeout handling, state checkpointing, and crash recovery. Without it, a 2% per-call failure rate compounds across 50 calls into something that fails constantly. With it, you have a system that can survive network errors, resume interrupted runs, and actually be trusted in production.

Why it matters

I have spent more time debugging harness issues than model issues. The model mostly does what it is supposed to. The harness is where tasks get dropped, state gets lost, and timeouts cause silent failures. The model is one component. The harness is what determines whether the whole thing works reliably.

Deep Dive

A harness is the scaffolding around a model call that makes it production-worthy. A raw model call that fails 2% of the time is not a minor issue in a workflow that makes 50 model calls per task. That 2% compounds. A harness with exponential backoff retry turns it into near-zero. A harness with state checkpointing means that when a network error interrupts a 40-step run at step 35, the next invocation resumes from step 35 rather than starting over. These are not engineering luxuries. They are what make the difference between a demo and something you can trust with real work.

Anthropic's November 2025 post on effective harnesses for long-running agents describes a specific architecture worth understanding. There is an INITIALIZER AGENT that runs once at the start and creates durable artifacts on the filesystem: a feature list, a progress notes file, an init script. There is a CODING AGENT that runs each session, reads those artifacts to reconstruct its state, implements one feature, and writes clean progress notes before exiting. Session context lives in files, not in conversation history. This is why the pattern works across arbitrary interruptions and arbitrary numbers of sessions. The agent does not need to remember. It reads.

LangGraph's checkpointer system and Temporal for AI workflows solve the same durability problem from different layers of the stack. LangGraph saves the full graph state at each node execution, enabling resume from any specific point in an agentic workflow. Temporal provides durability at the infrastructure level. Code executes exactly once even if the worker crashes mid-execution, guaranteed by a write-ahead log of workflow state. The choice between them comes down to where you want the recovery logic to live. LangGraph recovery is application-level and visible in your code. Temporal recovery is infrastructure-level and largely invisible, which is either reassuring or opaque depending on how much you like having control.

In the Wild

Anthropic Long-Running Agent Harness

OpenAI Agents SDK Runner

LangGraph Checkpointer

Temporal for AI Agents

Go Deeper

GUIDEEffective Harnesses for Long-Running AgentsAnthropic Engineering · 2025 GUIDEBuilding Effective AgentsAnthropic · 2024 DOCSLangGraph Persistence and CheckpointingLangChain ARTICLEDurable Execution Meets AI: Why Temporal is the Perfect Foundation for AI AgentsTemporal

Related Patterns

Autonomous Loops Memory Patterns Observability and Tracing

All 26 patternsRead the blogHome