Guardrails (AI safety boundaries)

Protective mechanisms that constrain AI system behavior and prevent unwanted, harmful, or out-of-scope outputs.

What are guardrails?

Guardrails are mechanisms that monitor both the inputs and outputs of an AI system, preventing the model from responding in ways that are dangerous, out of scope, or otherwise undesirable. They are the rules, filters, and control layers wrapped around an LLM.

Types of guardrails

Input guardrails: Filter the user's query before it reaches the model - blocking inappropriate or dangerous requests.
Output guardrails: Check the model's response before showing it to the user - catching hallucinations, sensitive data, or off-topic outputs.
System prompt: The baseline level of guardrails - instructions embedded directly in the prompt.

Example use cases

A customer support chatbot must not discuss competitors → a guardrail redirects the query
Detection of PII (personal data) in output → a guardrail anonymizes it before delivery
Blocking prompt injection attempts → an input guardrail filters suspicious patterns

Tools for guardrails

Popular frameworks such as NeMo Guardrails (NVIDIA) and Guardrails AI allow you to define rules declaratively and integrate them into any LLM pipeline.