Low quality articles honestly. Calling a bash script that takes a private ssh ke...

abujazar · 2025-06-09T00:27:06 1749428826

By invoking, do you mean installing/configuring the MCP server? It's the LLM that decides which MCPs to use.

garbanz0 · 2025-06-09T00:20:29 1749428429

Say you have several MCPs installed on a coding agent. One is a web search MCP and the other can run shell commands. Your project uses an AI-related package created by a malicious person who knows than an AI will be reading their docs. They put a prompt injection in the docs that asks the LLM to use the command runner MCP to curl a malicious bash script and execute it. Seems pretty plausible no?

simonw · 2025-06-09T00:23:31 1749428611

That's pretty much the thing I call the "lethal trifecta" - any time you combine an MCP (or other LLM tool) that can access private data with one that gets exposed to malicious instructions with one that can exfiltrate that data somewhere an attacker can see it: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

dinfinity · 2025-06-09T00:21:29 1749428489

It's a question as to how easily it is broken, but a good instruction to add for the agent/assistant is to tell it to treat everything outside of the instructions explicitly given as information/data, not as instructions. Which is what all software generally should be doing, by the way.

simonw · 2025-06-09T00:25:26 1749428726

The problem is that doesn't work. LLMs cannot distinguish between instructions and data - everything ends up in the same stream of tokens.

System prompts are meant to help here - you put your instructions in the system prompt and your data in the regular prompt - but that's not airtight: I've seen plenty of evidence that regular prompts can over-rule system prompts if they try hard enough.

This is why prompt injection is called that - it's named after SQL injection, because the flaw is the same: concatenating together trusted and untrusted strings.

Unlike SQL injection we don't have an equivalent of correctly escaping or parameterizing strings though, which is why the problem persists.

tedunangst · 2025-06-09T01:04:16 1749431056

People will never give up the dream that we can secure the LLM by saying please one more time than the attacker.

NeutralCrane · 2025-06-09T02:43:06 1749436986

No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.

simonw · 2025-06-09T03:01:26 1749438086

That doesn't work either. It's always possible to come up with an attack which subverts the "moderator" model first.

Using non-deterministic AI to protect against attacks against non-deterministic AI is a bad approach.

K0balt · 2025-06-09T10:07:08 1749463628

So you just need another agent to review the data being passed to the protector agent. Easy-peasy.

Use my openAI referral code #LETITRAIN for 10% off!