Guardrails and Safety

What happens when you give an agent keys and no limits

The more autonomy you give an agent, the more damage it can do when something goes wrong. Guardrails are the controls that prevent that. They validate inputs and outputs, enforce cost limits, scope what the agent is allowed to touch, and detect prompt injection from untrusted content in the environment. The failure modes they address are varied and some of them are genuinely nasty.

Why it matters

Skipping guardrails is a decision that looks fine until it is not. The interesting failure modes do not happen in demos. They happen when the agent encounters an adversarial document in a production environment, or when a loop runs ten thousand times instead of ten, or when an agent with read access quietly attempts a write.

Deep Dive

Guardrails address a set of failure modes that only become visible once you give an agent real autonomy. Prompt injection is the one that surprises people most: a document the agent reads from the web or a file system contains instructions designed to redirect the agent, and a naive agent follows them. Runaway costs are less dramatic but more common: a loop expected to run ten times runs ten thousand because no one set a limit. Permission scope violations happen when an agent with read access finds a way to write. Confidently wrong outputs that pass format validation but are semantically wrong are the hardest to catch. Each of these requires a different type of control.

NVIDIA's NeMo Guardrails, presented at EMNLP 2023, gave teams a programmable rails system with a domain-specific language for defining topical constraints, safety filters, and fact-checking requirements. Guardrails AI takes a validation-first approach: every model output passes through a suite of validators before use, with automatic repair attempts for violations that are fixable. Lakera Guard focuses specifically on prompt injection detection, which is the attack vector that concerns most teams once their agents start processing untrusted content from external sources.

Prompt injection defense is the area where the research is still catching up to the threat. Any content the agent reads from the environment could contain instructions designed to hijack its behavior. Anthropic's research on this identifies three lines of defense: skepticism heuristics built into the model itself, output validation that flags unexpected behavior changes, and privilege separation where the agent assigns different trust levels to content it generates versus content it retrieves from external sources. Most production teams end up layering multiple approaches because no single control is sufficient on its own.

In the Wild

Guardrails AI

Lakera Guard

NVIDIA NeMo Guardrails

OpenAI Agents SDK Guardrails

Go Deeper

PAPERNeMo Guardrails: A Toolkit for Controllable and Safe LLM ApplicationsNVIDIA / arXiv · 2023 ARTICLEPrompt Injection DefensesAnthropic Research GUIDEBuilding Effective AgentsAnthropic · 2024 DOCSGuardrails AI DocumentationGuardrails AI DOCSNVIDIA NeMo Guardrails DocumentationNVIDIA

Related Patterns

Sandboxes Human-in-the-Loop Evaluation and Testing

All 26 patternsRead the blogHome