Generative AI Security Guardrails

Prompt injection is the SQL injection of the AI era. And just like SQL injection was dismissed as an edge case by many teams in the early 2000s, prompt injection is being systematically underestimated today.

This is not a theoretical risk. We have audited production systems where a single crafted user message could exfiltrate the entire system prompt, bypass content filters, or cause the agent to perform unauthorised actions on behalf of the attacker.

The Threat Model

Before implementing guardrails, you need a clear threat model. The three primary adversarial objectives against a generative AI system are:

Extraction — retrieving information the model has access to but should not disclose. System prompts, retrieved documents, other users' conversation history.

Manipulation — changing the model's behaviour in ways the operator did not intend. Bypassing content policies, impersonating different personas, executing unauthorised tool calls.

Amplification — using the AI as a vector to attack downstream systems. Injecting malicious content into outputs that will be processed by other systems, databases, or users.

The Guardrail Stack

Input Validation

Every user input should pass through a validation layer before reaching the model:

Length limits — long inputs can exhaust context and crowd out the system prompt
Content classification — a fast, cheap classifier to flag adversarial patterns before they reach the expensive model
PII detection — prevent sensitive data from being sent to external APIs inadvertently

We use a small, fast classification model (typically a fine-tuned BERT variant) as a first-pass filter. This costs under 0.1ms and catches 80%+ of known injection patterns.

Prompt Architecture

The structure of your prompt is itself a security control:

System instructions should be separated from user content with clear delimiters that are hard to escape
Tool permissions should be specified explicitly, not implicitly
The prompt should state what the model should do when it detects adversarial input — most models will follow these instructions

Output Validation

Don't trust model output blindly:

Validate that structured outputs conform to expected schemas before consuming them
Redact PII patterns from outputs before displaying to users
Log all outputs for post-hoc audit — if something goes wrong, you need the receipts

Tool Call Sandboxing

If your agent executes tool calls, each tool should operate under a principle of least privilege:

Database queries should use read-only credentials unless write access is explicitly required for that tool
File system access should be restricted to a defined sandbox directory
API calls should be rate-limited and logged

The Red Team Test

Before launching any customer-facing AI system, run a structured red team exercise. Ask someone with adversarial intent — ideally an external security researcher — to spend 4 hours trying to break it.

The findings will surprise you. They always do.

Our team conducts AI security audits for enterprise deployments. Book a security review.