Field Guide

AgentX: 26 AI Agent Patterns

I built this page because I kept losing track of which pattern solved which problem. The agent ecosystem moves fast and the vocabulary is still settling. Having a single reference that covers the full stack, from ReAct loops to sandboxes to eval frameworks, has been useful for me personally when scoping new work.

These 26 patterns cover the five layers I think about when building production agents: foundations (the primitives), patterns (the architectures), infrastructure (the plumbing), surfaces (where users touch the system), and design (the practices that make it trustworthy). You can build a working agent without understanding all of them. You cannot build a reliable one without most of them.

The descriptions are written from direct experience, not from reading documentation. Some of the opinions are strong. Treat the examples as a starting point, not an endorsement.

01

Foundations

The mental models you need before building anything

Before you pick a framework or wire up your first tool, you need to understand how agents actually work at the loop level. Getting the primitives wrong means debugging the wrong things later.

Tool Use and Function Calling

01

How a model stops talking and starts doing

I think about tool use as the moment a language model stops being a text generator and starts being something you can actually give work to. The model outputs structured JSON. Your code reads that JSON and does something real: calls an API, writes a file, queries a database. That is the whole trick. MCP, computer use, all of it sits on top of this one primitive. If you understand this, you understand why agents can do things at all.

Why it matters. Every agent capability you will ever build traces back here. A model without tool use is genuinely useful but it is not an agent. The gap between "can generate text about doing X" and "can actually do X" is exactly this primitive. That gap is worth understanding before you build anything on top of it.

Anthropic Tool Use APIInstructor (3M+ monthly downloads)PydanticAICloudflare Agents SDK
Deep dive →

ReAct Pattern

02

Most agent bugs live in one of three places

ReAct is the loop that most agents are actually running, even when the framework does not call it that. The model reasons about what to do, calls a tool, reads the result, then reasons again. That cycle repeats until the task is done or something breaks. Once I understood this I stopped debugging agent failures as mysteries and started treating them as broken steps in a sequence. That shift matters a lot.

Why it matters. When a Claude Code session goes sideways, it went wrong at a specific Thought, Action, or Observation. Knowing there is a loop to inspect changes how you approach debugging entirely. It is the difference between "the agent failed" and "the agent reasoned wrong after this specific tool call." The second one you can actually fix.

LangGraphAgno (formerly Phidata)Gemini CLI
Deep dive →

Planning and Decomposition

03

Agents that skip planning usually fail twice

An agent without a plan is just pattern-matching its way forward, one tool call at a time. Planning means breaking a complex goal into an ordered set of subtasks, working out what depends on what, and building that structure before anything actually runs. The plan does not need to be perfect. It needs to exist. Agents that skip this fail quietly on tasks with three or more moving parts, and the failure is hard to diagnose because no single step looks wrong.

Why it matters. Planning is what makes agents usable on real work. It is also where the hardest UX problems live. A wrong plan that runs autonomously for ten minutes before you realize it is wrong is an expensive mistake, and the cost is not just compute. It is trust. Getting planning right is what lets you give an agent actual responsibility.

CrewAIDevin 2.0OpenAI Agents SDK agent-as-tool
Deep dive →

Reflection and Self-Correction

04

The second pass that catches what the first missed

Reflection is the agent stopping to look at what it just produced before moving forward. Did the code actually compile? Does the answer address what was asked, or a slightly different question the model found easier? Self-correction loops catch a real percentage of errors before they reach anyone. The tradeoff is extra latency and cost, which is a real engineering decision you have to make deliberately rather than defaulting to either extreme.

Why it matters. Without reflection, a model that gets something wrong will often keep building on that wrong foundation. The confidence of the output does not correlate with its correctness. A bad answer delivered with certainty is not neutral, it is actively harmful because the person on the other end trusts it more than they should. Reflection is what closes that gap.

LangGraph Reflection AgentsDSPyReflexion
Deep dive →
02

Patterns

Concrete architectures you will implement

These are battle-tested blueprints, not abstractions. Each one solves a real engineering problem that most agent builders hit within the first month of production work.

Model Context Protocol (MCP)

05

Write the tool once, use it everywhere

Before MCP, every agent application built its own integrations. You wanted your tool to work in Claude Code and also Cursor? Write it twice. MCP is the standard that changed that: one server, any compatible client. It is now under the Linux Foundation, which means it is no longer Anthropic's spec to change unilaterally. The ecosystem is real: 5,800+ community servers and 97 million monthly SDK downloads.

Why it matters. I use Claude Code daily and I connect MCP servers to it constantly. The productivity gain is not theoretical. You stop writing the same GitHub integration over and over and start actually building the thing you care about. That reduction in integration overhead is real money saved per agent project.

Claude CodeCursorVS Code CopilotGemini CLICloudflare Codemode
Deep dive →

Computer Use Agents (CUA)

06

If a human can click it, an agent can reach it

Most software in the world does not have an API. Computer use is the pattern that solves that. The model takes a screenshot, figures out what it is looking at, then outputs clicks and keystrokes like a human would. Anthropic launched it in October 2024. OpenAI followed in January 2025. It is slower and less reliable than API-based tool use, but for legacy systems and anything behind a GUI it is often the only option you have.

Why it matters. I think of this as the long tail of automation. The well-maintained SaaS tools have APIs. Everything else, the internal dashboards, the government portals, the decade-old desktop software that the operations team depends on, does not. Computer use is the only way to reach those. You pay for it in latency and occasional breakage, but sometimes that trade is worth making.

Browser Use (58k+ GitHub stars)Anthropic Computer UseCloudflare Browser Rendering
Deep dive →

Skills

07

Load what you need, not everything you know

A skill is a reusable bundle of instructions, tools, and domain knowledge that an agent loads at runtime based on what it is being asked to do. Instead of one giant system prompt that tries to cover every situation, you compose focused capability modules. The SKILL.md convention gives each skill a discoverable interface that lives in the filesystem alongside your code. Versioned, reviewable, shareable.

Why it matters. I have a /review-pr skill and a /security-audit skill that I have tuned over months. When I type them in Claude Code, I get consistent behavior every time. Without skills, I would be re-explaining my review criteria in the chat on every session. The context window stays cleaner and the agent stays focused on what actually matters for the current task.

Agent Skills StandardClaude Code Slash CommandsCopilot ExtensionsSkills.sh
Deep dive →

agents.md

08

The file that tells every agent how to behave here

AGENTS.md is a Markdown file you put at the root of a repository to tell AI coding agents how to behave in your project. What it is, how it is structured, what the agent should and should not touch. OpenAI adopted it for Codex, Anthropic uses it as CLAUDE.md for Claude Code, Google adopted it for Gemini CLI. Over 60,000 repos have one. The Linux Foundation now maintains the spec.

Why it matters. Every project I work on has a CLAUDE.md. Without it, the agent guesses at conventions and it is wrong enough to be annoying. With it, I write the constraints once and every session benefits automatically. It is behavioral configuration that lives in the repo, not in some chat box no one else can see.

AGENTS.md StandardCLAUDE.mdCursor RulesCopilot Instructions
Deep dive →

Multi-Agent Orchestration

09

What to reach for when one agent is not enough

Some tasks are too large to fit cleanly in one agent pass. Multi-agent orchestration splits the work: a planner that breaks down the goal, a researcher that gathers information, a coder that implements, a reviewer that checks the output. Each agent has its own narrow scope, tools, and instructions. The orchestrator coordinates them. The complexity is in the coordination layer, not in any individual agent.

Why it matters. Context windows are finite. Specialization helps quality. When I run a research-heavy coding task, a single general-purpose agent gets sloppy across both. Separate agents for research and implementation each do their part better. The tradeoff is coordination overhead. Whether that trade is worth it depends on how long and how complex the task is.

OpenAI Agents SDKLangGraphCrewAI (44k+ stars)Cloudflare Workflows
Deep dive →

Routing and Intent Detection

10

Match the task to the model, not the other way around

Not every request deserves the same model or the same agent. Routing classifies incoming requests and sends each one to the right handler. Simple queries go to a fast cheap model. Complex ones go to the expensive powerful one. Domain-specific requests go to the specialized agent. The goal is matching cost and capability to actual difficulty, rather than treating every request identically.

Why it matters. If you run everything through your most capable model, you are probably overpaying by a lot. RouteLLM showed cost reductions of 40-85% with minimal quality loss just by routing well. Most requests in a production system are simple. A small routing layer that identifies the complex ones and escalates them pays for itself very quickly.

NVIDIA LLM RouterVercel AI SDK MiddlewareHierarchical Agent Routing
Deep dive →

Autonomous Loops

11

Let it cook

An autonomous loop wraps an agent in a persistent execution cycle. Give it a task list and exit conditions, then let it run for hours without waiting for human input at every step. Claude Code has /loop, Devin has long-running tasks, Cursor has Cloud Agents. The concept is the same across all of them. You need three things to do this safely in production: circuit breakers, rate limiting, and sandboxed execution. Without those, it is not a feature.

Why it matters. This is the pattern that moves an agent from assistant to worker. Without a loop, every conversation resets. The agent cannot build on its previous work. With a well-designed loop, I can hand off a multi-hour task, go do something else, and come back to results. That is a qualitatively different relationship with the tool.

Devin 2.0Cursor Cloud AgentsOpenAI CodexClaude Code /loop
Deep dive →
03

Infrastructure

The plumbing that holds it all together

Infrastructure is invisible when it works and catastrophic when it does not. Most early-stage agent projects skip this layer and pay for it in production outages and debug sessions that take days.

Memory Patterns

12

Giving agents a past and a future

By default, agents start each session with no idea what happened last time. Memory patterns are how you fix that. They give the agent a way to store and retrieve past interactions, learned preferences, and accumulated facts. The tricky part is not storing things. It is deciding what to keep, what to discard, and whether you can actually retrieve the right thing when you need it.

Why it matters. Without memory, every conversation is a first conversation. That is fine for simple tasks and completely wrong for anything spanning multiple sessions. Memory is also where most teams underestimate the engineering work. Storing is easy. Retrieval that actually works is not.

Mem0 (91% lower latency)Letta (formerly MemGPT)ZepCloudflare Durable Objects
Deep dive →

Context Management

13

Fitting the world into a window

Context windows are finite and agentic tasks burn through them fast. Every tool result, every retrieved document, every piece of intermediate reasoning takes up space. Context management is the practice of deciding what the agent actually needs to see right now, and getting rid of everything else. It sounds simple. In a multi-step agent running dozens of tool calls, it is one of the most consequential decisions you will make.

Why it matters. Context management is not a performance optimization. It is a correctness problem. A context full of stale or irrelevant information changes what the model says. I have watched agents give completely different answers to the same question based purely on what else happened to be in the context at that moment.

Context CompactionPrompt CachingRAG pipelinesHierarchical Memory Systems
Deep dive →

Structured Output

14

The part your downstream code actually reads

Agents do not just talk. They call functions, fill schemas, and return JSON that other systems consume. Structured output is the pattern of making that reliable. When the model returns malformed JSON, everything downstream breaks. The two main approaches are constrained decoding, which makes invalid output impossible at the token level, and validation with retry, which catches failures and tries again with the error attached.

Why it matters. A tool call that returns malformed JSON two percent of the time is not a minor inconvenience. It is a production incident that happens every fifty tool calls. At scale, that failure rate is unacceptable. Structured output reliability is what separates a demo from a system you can actually trust.

Instructor (3M+ monthly downloads)PydanticAIOutlines
Deep dive →

Guardrails and Safety

15

What happens when you give an agent keys and no limits

The more autonomy you give an agent, the more damage it can do when something goes wrong. Guardrails are the controls that prevent that. They validate inputs and outputs, enforce cost limits, scope what the agent is allowed to touch, and detect prompt injection from untrusted content in the environment. The failure modes they address are varied and some of them are genuinely nasty.

Why it matters. Skipping guardrails is a decision that looks fine until it is not. The interesting failure modes do not happen in demos. They happen when the agent encounters an adversarial document in a production environment, or when a loop runs ten thousand times instead of ten, or when an agent with read access quietly attempts a write.

Guardrails AILakera GuardNVIDIA NeMo GuardrailsOpenAI Agents SDK Guardrails
Deep dive →

Sandboxes

16

Letting agents act without breaking things

If an agent can write and execute code on your machine, it can do essentially anything your machine can do. Sandboxes solve this by giving the agent a fully isolated environment: its own filesystem, its own network, its own process space. It can run arbitrary code, browse the web, modify files, and crash itself repeatedly. Your production systems stay untouched. The MicroVM spins up in 150 milliseconds and disappears when the task finishes.

Why it matters. The threat model for an agent running in a MicroVM that spins up and disappears is genuinely different from one running directly on your infrastructure. That is not an abstract point. It is the difference between a crash that ends a session and a crash that corrupts a production database.

E2B (150ms startup)Daytona (sub-90ms cold starts)Browserbase (50M+ sessions)Cloudflare Workers
Deep dive →

Agent Harnesses

17

The scaffolding between a model call and a working product

A harness is the scaffolding that sits between a raw model call and a working agent in production. It handles the tool-call loop, retry logic, timeout handling, state checkpointing, and crash recovery. Without it, a 2% per-call failure rate compounds across 50 calls into something that fails constantly. With it, you have a system that can survive network errors, resume interrupted runs, and actually be trusted in production.

Why it matters. I have spent more time debugging harness issues than model issues. The model mostly does what it is supposed to. The harness is where tasks get dropped, state gets lost, and timeouts cause silent failures. The model is one component. The harness is what determines whether the whole thing works reliably.

Anthropic Long-Running Agent HarnessOpenAI Agents SDK RunnerLangGraph CheckpointerTemporal for AI Agents
Deep dive →
04

Surfaces

Where agents meet humans

Surfaces are the least-discussed part of agent design and the part users actually experience. The underlying patterns can be excellent and the product can still feel wrong if the surface is poorly considered.

Generative UI

18

The agent decides what to show, not just what to say

Most agent UIs pre-build every possible component and then try to figure out which one to show. Generative UI flips that. The model decides what to render based on what the data actually calls for. A table, a form, a chart, a custom component. The AG-UI protocol from CopilotKit gives this a concrete event-stream model you can actually build on top of.

Why it matters. The difference between an agent that feels like a product and one that feels like a chat window bolted onto your app is usually this pattern. Static components mapped to static outputs look like a demo. Generative UI looks like software.

CopilotKit (10%+ Fortune 500)v0 by VercelVercel AI SDK (20M+ monthly downloads)assistant-ui
Deep dive →

IDE-Embedded Agents

19

Context the model already has, where you are already working

Cursor, Windsurf, Copilot, Cline, and Zed all have agents that live inside the editor with access to your open files, language server data, git history, and terminal output. They already know your project. The range runs from tab completion all the way to autonomous agent mode. This is where most developers get their first real exposure to agents doing actual work.

Why it matters. You spend most of your working hours in an editor. The agent is already there with context it does not need to be told. That is a fundamentally different starting point than a chat window where you have to describe everything from scratch.

CursorGitHub CopilotCline (VS Code, open-source)ZedWindsurf
Deep dive →

TUI and CLI Agents

20

The fastest interface is the one you are already in

Claude Code, Aider, OpenCode, Gemini CLI, and Goose live in the terminal and work directly with the shell, filesystem, and dev tools. No GUI. No browser tab. They compose with pipes and scripts and can run headlessly in CI without any modification. This is where the power-user workflow lives, and also where automation starts.

Why it matters. CLI agents are the fastest interface to invoke and the easiest to automate. No browser tab needs to be open. No GUI needs to render. They are the same agent in interactive mode and in CI, which matters more than it sounds.

Claude CodeAider (39k+ stars)OpenCode (75+ LLM providers)Gemini CLI (1,000 req/day free)Goose (Block)
Deep dive →

Chat Interfaces

21

Looks simple. Nothing about it is simple.

ChatGPT, Claude.ai, Gemini, and DeepSeek are where most people first encounter what agents can do. The format looks simple because the interface is simple. The engineering underneath is not: streaming responses, progressive rendering, multi-turn context management, rich output formats. Chat is the lowest-friction entry point and also the most constrained one.

Why it matters. Chat sets the mental model most people carry into every other agent surface. Whatever they believe agents can and cannot do is shaped here first. That is a lot of weight for a text box to carry.

ChatGPT (GPT-5, canvas, plugins)Claude.ai (200K context, Projects)Gemini (1M token context)Perplexity (citations, real-time search)
Deep dive →

Multi-Agent Workspaces

22

A control room for your agents

Multiple agents working on a shared project is powerful and, without a workspace, almost impossible to follow. Multi-agent workspaces give each agent persistent identity, shared state, and a visible interface so the human in the loop can actually see what is happening. VS Code called itself a multi-agent platform in January 2026. That framing was not accidental.

Why it matters. Multi-agent orchestration is invisible by default. You get outputs but not reasoning. Workspaces make the operation legible. You can see what each agent is doing, catch errors before they compound, and step in without stopping everything.

VS Code (v1.109 multi-agent platform)Claude Code Agent TeamsDustCrewAI Cloud
Deep dive →

Headless and CI Agents

23

Scales with commits, not headcount

Agents running in CI pipelines, cron jobs, and background processes with no human watching. Code review bots, automated refactoring, dependency updates, test generation. They run on every PR, every commit, around the clock. This is where agents stop being tools and start being infrastructure, and the operational requirements change to match.

Why it matters. Headless agents scale with your commit volume, not your headcount. A team of three can have the review coverage of a team of ten if the agent is reliable. That changes what small teams can actually ship.

Claude Code -p flag (headless/CI)Codex on GitHubGitHub Copilot AutofixGitLab Duo
Deep dive →
05

Design

Making agents usable, testable, and trustworthy

These are the unglamorous patterns that determine whether anyone can actually trust what the agent produces. They rarely make it into conference talks. They are almost always the reason production agents succeed or fail.

Human-in-the-Loop

24

Full autonomy sounds great until something gets deleted

I have run agents that behaved perfectly in staging and did something alarming in production. Human-in-the-loop patterns are checkpoints where the agent stops and waits, rather than deciding unilaterally. The question is not whether to have checkpoints, it is where to put them. That depends on how reversible the action is, and how much you have actually tested the agent, not how confident you feel about it.

Why it matters. Irreversibility is the thing that bites you. Drafting a document wrong is annoying. Deleting a branch, sending a mass email, pushing to production without review, those are a different category. Tight approval gates at the start are not a sign of distrust. They are how you build the evidence to loosen them later.

LangGraph interrupt()HumanLayer (Slack/email/Discord)CopilotKit pause executionRent a Human (MCP)
Deep dive →

Observability and Tracing

25

A system you cannot observe is a system you cannot fix

An agent ran 14 steps, returned the wrong answer, and I had no idea where it went wrong. That is the problem observability solves. Traces capture every tool call, reasoning step, and decision point so you can replay exactly what happened. Without that, debugging is just reading the final output and guessing at the middle.

Why it matters. Agents are non-deterministic and multi-step. The same input can produce different paths on different runs. You cannot reproduce failures from the output alone. If you cannot trace what the agent actually did, you are not debugging, you are guessing at a system you do not understand.

Langfuse (open-source, self-hostable)Braintrust (1M spans/month free)Arize Phoenix (OpenTelemetry)Cloudflare AI Gateway
Deep dive →

Evaluation and Testing

26

How do you know your agent is good?

I upgraded a model once and a capability I relied on quietly got worse. I did not find out for a week. Evals are what prevent that. Agent evals are harder than benchmark accuracy because you need to test tool selection, step ordering, failure recovery, and end-to-end outcomes across multi-step tasks. A simple right or wrong score does not capture any of that.

Why it matters. Silent regressions are worse than loud failures. A crash surfaces immediately. An agent that produces subtly wrong results with full confidence can run for weeks before someone catches it. Evals make the regression visible at the moment you introduce the change, not after it reaches users.

SWE-bench Verified (500 validated tasks)LM Arena (crowdsourced evals)Terminal-Bench (CLI agents)Inspect AI (UK AI Safety Institute)
Deep dive →

The pattern taxonomy is inspired by agent-experience.dev by Brandon. The framing, prose, and opinions here are my own — written from building agents, not from reading about them.

26 patterns across 5 categoriesRead the blogBack to home