4276

OpenAI’s Guardrails Framework Bypassed by Basic Prompt Injection

On October 6, 2025, OpenAI released Guardrails, a new safety framework designed to detect and prevent harmful behaviors in AI systems by leveraging large language models (LLMs) to judge inputs and outputs for risks like jailbreaks, prompt injections, and more. While the framework represents a step forward in modular AI safety, recent research from cybersecurity firm HiddenLayer has revealed significant vulnerabilities showing how basic prompt injection techniques can completely bypass these safeguards.

What is OpenAI’s Guardrails Framework?

Guardrails aims to empower developers with customizable pipelines that filter malicious or harmful interactions in AI agents. Key capabilities include:

  • Masking personally identifiable information (PII)
  • Content moderation
  • LLM-based checks for complex threats like jailbreak attempts, off-topic prompts, hallucinations, and agentic prompt injections (where tools’ outputs misalign with user intent)
  • Non-LLM tools for filtering URLs and detecting PII

At its core, the framework uses an LLM “judge” to evaluate the safety of outputs generated by the same or similar LLMs powering the AI agent.

“Same Model, Different Hat”

HiddenLayer’s research highlights a fundamental flaw in this design. When the same or a similar LLM is used both to generate content and to judge its safety, attackers can exploit prompt injection vulnerabilities to manipulate both simultaneously. This “same model, different hat” problem means that if the base model is tricked into generating malicious content, the judge model can also be tricked into approving it essentially nullifying the safety layer.

This stands in contrast to traditional non-LLM detectors, which avoid such weaknesses but lack the flexibility to handle nuanced threats like obfuscated jailbreak prompts.

How Attackers Bypass Guardrails

HiddenLayer demonstrated multiple real-world bypasses, including:

  • Jailbreak Prompt Manipulation: Using OpenAI’s default gpt-4.1-mini with a 0.7 confidence threshold, the system correctly flagged a malicious prompt requesting harmful instructions (e.g., anthrax sporulation) with 95% confidence. However, attackers crafted injection templates that spoofed the judge’s metadata, reporting a false lower confidence score (e.g., 0.675) below the threshold (e.g., 0.6), allowing the dangerous content to slip through unnoticed.
  • Agentic Prompt Injection Attacks: The system’s mechanism to block tool misuse, such as unauthorized “fetch” calls that leak sensitive data, can also be circumvented. By embedding fabricated judge override data within web content, attackers convinced the judge LLM to approve malicious tool calls, enabling indirect data exfiltration and command injection.

These exploits underscore the fragility of relying on LLM-based self-judgment for safety-critical decisions.

Why This Matters

As AI systems increasingly integrate into sensitive enterprise workflows, the stakes for robust safety measures have never been higher. Guardrails’ modular design and use of LLM judges are promising — but as HiddenLayer’s findings show, over-reliance on the same model family for both generation and evaluation invites sophisticated adversarial tactics that can evade detection.

Moreover, this research builds on earlier work like HiddenLayer’s Policy Puppetry (April 2025), which demonstrated universal prompt injection bypasses across major models.

Recommendations for AI Safety

To mitigate risks highlighted by this research, organizations and AI developers should consider:

  • Independent validation layers outside the generating LLM family
  • Red teaming and adversarial testing focused on prompt injection and judge manipulation
  • External monitoring and anomaly detection for AI outputs and tool interactions
  • Careful evaluation of confidence thresholds and metadata integrity
  • Avoiding sole reliance on self-judgment mechanisms

OpenAI’s Guardrails framework marks meaningful progress in modular AI safety but to avoid false security, it must evolve beyond vulnerable self-policing and incorporate diverse, independent safeguards.

Spoofed Homebrew install page (Source - The Sequence)

Sophisticated Homebrew Installer Spoofing Campaign Targets macOS Users

A new and highly polished campaign is targeting macOS users by cloning the Homebrew installation experience and quietly slipping malicious commands into victims’ clipboards. Instead of attacking Homebrew’s package repositories, attackers are impersonating the trusted installation page itself and hijacking the moment users paste the install command.

What’s happening

Researchers uncovered several pixel-perfect replicas of the official Homebrew installer page. Fraudulent domains identified include:

  • homebrewfaq[.]org
  • homebrewclubs[.]org
  • homebrewupdate[.]org

These sites look and behave like the genuine Homebrew install page, but they include hidden JavaScript that interferes with normal copy-and-paste behavior. Rather than allowing users to select the install command manually, the spoofed pages disable normal text selection and force visitors to click a site-provided Copy button. That button runs code which injects extra, malicious commands into the clipboard along with the legitimate Homebrew installer command.

How the attack works

  • The attacker creates a convincing replica of the Homebrew install page so users won’t suspect anything is wrong.
  • The page blocks standard selection and clipboard events (contextmenu, selectstart, copy, cut, dragstart), preventing manual copying of the installation text.
  • A visible Copy button triggers a copyInstallCommand() routine in JavaScript. That routine writes a command string to the clipboard using the Clipboard API or a textarea fallback for compatibility across browsers.
  • When the victim pastes that clipboard content into Terminal and runs it, the legitimate Homebrew install command executes but it’s accompanied by the attacker’s injected command(s), which download and run additional payloads in the background.
  • Because the real Homebrew installer runs normally, the infection can be stealthy and persistent while appearing innocuous to the user.

Security analysts also noted Russian-language comments in the code showing where malicious commands are inserted — a sign this may be a commoditized service or a repeatable toolkit attackers can reuse.

Why this is notable

This campaign represents a significant shift in supply-chain style tactics. Instead of compromising package repositories or tampering with software packages directly, attackers have built a parallel interception point: the initial installation experience. That bypasses many defenses that focus on repository integrity and package signing, and it relies instead on social engineering and subtle client-side manipulation of the clipboard.

Homebrew itself has no recent compromise reports, but the attack exploits the strong user trust placed in Homebrew’s installation instructions.

For safety reasons I’ve redacted the exact malicious command observed in the wild. Publishing exact live payload commands or download URLs could enable abuse. If you need to analyze the specific artifacts for incident response, work with a trusted security team and obtain samples through secure channels.

Indicators and detection

Researchers identified the suspicious domains listed above and monitored infrastructure linked to known malware distribution networks. The telltale signs of this campaign include:

  • Pixel-perfect replicas of the Homebrew installer page hosted on non-official domains.
  • Disabled text selection and clipboard-related event handlers.
  • A required on-page Copy button (rather than allowing manual selection).
  • JavaScript routines that overwrite clipboard contents to append or prepend extra commands.
Anatomy-Of-A-Cyber-Attack-image

Trinity of Chaos -The New Face of Ransomware and Data Extortion

The cybersecurity world has been rocked by the rise of the Trinity of Chaos, a highly sophisticated ransomware collective that has launched a new data leak site featuring sensitive information from 39 major corporations. This group, possibly a merger of notorious hacker groups like Lapsus$, Scattered Spider, and ShinyHunters, represents a significant evolution in the scale and complexity of cybercrime.

The Trinity of Chaos collective is not just another ransomware gang, it is a hybrid threat actor that merges traditional ransomware tactics with data extortion strategies, creating a new and highly effective form of attack. By combining these methods, they maximize their operational impact and financial return, leaving organizations exposed to both financial losses and reputational damage.

Data Leak Sites on the TOR Network

The group’s primary method of operation revolves around their Data Leak Site, hosted on the TOR network. This is a familiar tactic among modern ransomware groups, and Trinity of Chaos has refined it to a level of operational sophistication that sets them apart.

Rather than announcing new attacks or publicizing their ransom demands upfront, the group opts to share samples of stolen data, including sensitive records, to prove the success of their breaches. This approach not only validates their claims but also increases the pressure on their victims by threatening public exposure. This calculated strategy ensures the group maintains operational security while leveraging the threat of reputational harm to manipulate their targets into compliance.

Previous Salesforce Exploit and Data-Exfiltration Tactics

Trinity of Chaos has already demonstrated their ability to exploit Salesforce environments, a method they refined by exploiting compromised Salesloft Drift AI chat integrations. By using social engineering techniques, the group gains unauthorized access to OAuth tokens, which they then use to infiltrate corporate Salesforce environments. This precise and targeted approach has proven to be highly effective, leading to substantial data breaches and stolen records.

The leaked data from these campaigns primarily includes personally identifiable information, but also reveals internal communications, loyalty program data, and full activity histories. In addition to using this data for extortion, Trinity of Chaos has proven adept at using it for further social engineering campaigns, gaining additional leverage over both companies and individuals.

This particular method of attack prompted the FBI to issue a flash warning, cautioning organizations to monitor their Salesforce instances for signs of intrusion.

Major Corporations Hit

The scale of the breach is unprecedented. Among the compromised organizations are some of the world’s most recognizable names, including:

  • Google
  • Cisco
  • Toyota Motor Corporation
  • FedEx
  • Disney/Hulu
  • Home Depot
  • Marriott
  • McDonald’s

These companies, spanning a range of industries including technology, automotive, finance, and telecommunications, are now facing the prospect of massive data leaks unless negotiations with the hackers are met.

Pressure Tactics and Ultimatums

Trinity of Chaos has set October 10th as a hard deadline for negotiations. Like many traditional ransomware operations, the group employs psychological pressure tactics, leveraging the threat of public data exposure and even regulatory reporting that could lead to criminal negligence charges for non-compliant companies.

This combination of tactics heightens the stakes for organizations and forces them to make quick decisions under intense pressure.

A Treasure Trove for Cybercriminals

The Trinity of Chaos collective claims to have amassed an incredible 1.5 billion records from over 760 companies, including:

  • 254 million account records
  • 579 million contact entries
  • 458 million case files

This data, collected over several years, comes from previous attack campaigns such as UNC6395 and UNC6040, showcasing the group’s systematic approach to data aggregation and monetization.

By compiling vast databases of stolen records, Trinity of Chaos is building a cybercrime empire with an unprecedented level of access to sensitive corporate and personal information.

Sophistication and Operational Security

What sets Trinity of Chaos apart is their operational security. The group is known to maintain persistent access within victim networks for extended periods of time, often remaining undetected for years.

This long-term, stealthy approach is indicative of a highly disciplined and experienced group, with extensive operational infrastructure that allows them to scale and evolve their methods over time.

The Rise of a Hybrid Cybercrime Syndicate

The Trinity of Chaos collective marks a significant evolution in the world of cybercrime. By blending ransomware tactics with data extortion and leveraging the TOR network for secure communications and leak sites, they are raising the stakes for both organizations and the cybersecurity industry at large. With an impressive track record, a global reach, and an ever-growing arsenal of attack methods, this group represents a formidable challenge to the cybersecurity landscape.

Organizations are urged to stay vigilant, fortify their defenses, and remain proactive in addressing any potential threats to prevent becoming the next victim of this highly skilled and resourceful group.