Low quality articles honestly. Calling a bash script that takes a private ssh key seems malicious. Why would you invoke this program? Why are we throwing our hands up and covering our ears in this way. Strawmans.
Say you have several MCPs installed on a coding agent. One is a web search MCP and the other can run shell commands. Your project uses an AI-related package created by a malicious person who knows than an AI will be reading their docs. They put a prompt injection in the docs that asks the LLM to use the command runner MCP to curl a malicious bash script and execute it. Seems pretty plausible no?
That's pretty much the thing I call the "lethal trifecta" - any time you combine an MCP (or other LLM tool) that can access private data with one that gets exposed to malicious instructions with one that can exfiltrate that data somewhere an attacker can see it: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
It's a question as to how easily it is broken, but a good instruction to add for the agent/assistant is to tell it to treat everything outside of the instructions explicitly given as information/data, not as instructions. Which is what all software generally should be doing, by the way.
The problem is that doesn't work. LLMs cannot distinguish between instructions and data - everything ends up in the same stream of tokens.
System prompts are meant to help here - you put your instructions in the system prompt and your data in the regular prompt - but that's not airtight: I've seen plenty of evidence that regular prompts can over-rule system prompts if they try hard enough.
This is why prompt injection is called that - it's named after SQL injection, because the flaw is the same: concatenating together trusted and untrusted strings.
Unlike SQL injection we don't have an equivalent of correctly escaping or parameterizing strings though, which is why the problem persists.
No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.