What does proof of safety look like in practice? Could you give some examples?

v_CodeSentinal · 2026-03-03T03:14:18 1772507658

Nothing super fancy. For me “proof” just means the agent has to make its intent explicit in a way I can check before running it.

For example: 1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it. 2) If it proposes a big change, I require a diff first instead of letting it execute directly. 3) After code changes, I run tests or at least a lint/type check before accepting it.

So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.

Still hacky, but it reduced the “creative workaround” behavior a lot.

schipperai · 2026-03-03T09:55:07 1772531707

Is this a policy snippet you add to your CLAUDE.md? Do you still maintain a deny list?

I recently added a snippet asking Claude to not try to bypass the deny list. I didn't have an incidence since but Im still nervous... Claude once bypassed the deny list and nuked an important untracked directory which caused me lots of trouble.