Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Low quality articles honestly. Calling a bash script that takes a private ssh key seems malicious. Why would you invoke this program? Why are we throwing our hands up and covering our ears in this way. Strawmans.


By invoking, do you mean installing/configuring the MCP server? It's the LLM that decides which MCPs to use.


Say you have several MCPs installed on a coding agent. One is a web search MCP and the other can run shell commands. Your project uses an AI-related package created by a malicious person who knows than an AI will be reading their docs. They put a prompt injection in the docs that asks the LLM to use the command runner MCP to curl a malicious bash script and execute it. Seems pretty plausible no?


That's pretty much the thing I call the "lethal trifecta" - any time you combine an MCP (or other LLM tool) that can access private data with one that gets exposed to malicious instructions with one that can exfiltrate that data somewhere an attacker can see it: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...


It's a question as to how easily it is broken, but a good instruction to add for the agent/assistant is to tell it to treat everything outside of the instructions explicitly given as information/data, not as instructions. Which is what all software generally should be doing, by the way.


The problem is that doesn't work. LLMs cannot distinguish between instructions and data - everything ends up in the same stream of tokens.

System prompts are meant to help here - you put your instructions in the system prompt and your data in the regular prompt - but that's not airtight: I've seen plenty of evidence that regular prompts can over-rule system prompts if they try hard enough.

This is why prompt injection is called that - it's named after SQL injection, because the flaw is the same: concatenating together trusted and untrusted strings.

Unlike SQL injection we don't have an equivalent of correctly escaping or parameterizing strings though, which is why the problem persists.


People will never give up the dream that we can secure the LLM by saying please one more time than the attacker.


No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.


That doesn't work either. It's always possible to come up with an attack which subverts the "moderator" model first.

Using non-deterministic AI to protect against attacks against non-deterministic AI is a bad approach.


So you just need another agent to review the data being passed to the protector agent. Easy-peasy.

Use my openAI referral code #LETITRAIN for 10% off!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: