On October 6, 2025, OpenAI released Guardrails, a new safety framework designed to detect and prevent harmful behaviors in AI systems by leveraging large language models (LLMs) to judge inputs and outputs for risks like jailbreaks, prompt injections, and more. While the framework represents a step forward in modular AI safety, recent research from cybersecurity firm HiddenLayer has revealed significant vulnerabilities showing how basic prompt injection techniques can completely bypass these safeguards.
What is OpenAI’s Guardrails Framework?
Guardrails aims to empower developers with customizable pipelines that filter malicious or harmful interactions in AI agents. Key capabilities include:
- Masking personally identifiable information (PII)
- Content moderation
- LLM-based checks for complex threats like jailbreak attempts, off-topic prompts, hallucinations, and agentic prompt injections (where tools’ outputs misalign with user intent)
- Non-LLM tools for filtering URLs and detecting PII
At its core, the framework uses an LLM “judge” to evaluate the safety of outputs generated by the same or similar LLMs powering the AI agent.

“Same Model, Different Hat”
HiddenLayer’s research highlights a fundamental flaw in this design. When the same or a similar LLM is used both to generate content and to judge its safety, attackers can exploit prompt injection vulnerabilities to manipulate both simultaneously. This “same model, different hat” problem means that if the base model is tricked into generating malicious content, the judge model can also be tricked into approving it essentially nullifying the safety layer.
This stands in contrast to traditional non-LLM detectors, which avoid such weaknesses but lack the flexibility to handle nuanced threats like obfuscated jailbreak prompts.
How Attackers Bypass Guardrails
HiddenLayer demonstrated multiple real-world bypasses, including:
- Jailbreak Prompt Manipulation: Using OpenAI’s default
gpt-4.1-mini
with a 0.7 confidence threshold, the system correctly flagged a malicious prompt requesting harmful instructions (e.g., anthrax sporulation) with 95% confidence. However, attackers crafted injection templates that spoofed the judge’s metadata, reporting a false lower confidence score (e.g., 0.675) below the threshold (e.g., 0.6), allowing the dangerous content to slip through unnoticed. - Agentic Prompt Injection Attacks: The system’s mechanism to block tool misuse, such as unauthorized “fetch” calls that leak sensitive data, can also be circumvented. By embedding fabricated judge override data within web content, attackers convinced the judge LLM to approve malicious tool calls, enabling indirect data exfiltration and command injection.
These exploits underscore the fragility of relying on LLM-based self-judgment for safety-critical decisions.
Why This Matters
As AI systems increasingly integrate into sensitive enterprise workflows, the stakes for robust safety measures have never been higher. Guardrails’ modular design and use of LLM judges are promising — but as HiddenLayer’s findings show, over-reliance on the same model family for both generation and evaluation invites sophisticated adversarial tactics that can evade detection.
Moreover, this research builds on earlier work like HiddenLayer’s Policy Puppetry (April 2025), which demonstrated universal prompt injection bypasses across major models.
Recommendations for AI Safety
To mitigate risks highlighted by this research, organizations and AI developers should consider:
- Independent validation layers outside the generating LLM family
- Red teaming and adversarial testing focused on prompt injection and judge manipulation
- External monitoring and anomaly detection for AI outputs and tool interactions
- Careful evaluation of confidence thresholds and metadata integrity
- Avoiding sole reliance on self-judgment mechanisms
OpenAI’s Guardrails framework marks meaningful progress in modular AI safety but to avoid false security, it must evolve beyond vulnerable self-policing and incorporate diverse, independent safeguards.