Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What does proof of safety look like in practice? Could you give some examples?
 help



Nothing super fancy. For me “proof” just means the agent has to make its intent explicit in a way I can check before running it.

For example: 1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it. 2) If it proposes a big change, I require a diff first instead of letting it execute directly. 3) After code changes, I run tests or at least a lint/type check before accepting it.

So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.

Still hacky, but it reduced the “creative workaround” behavior a lot.


Is this a policy snippet you add to your CLAUDE.md? Do you still maintain a deny list?

I recently added a snippet asking Claude to not try to bypass the deny list. I didn't have an incidence since but Im still nervous... Claude once bypassed the deny list and nuked an important untracked directory which caused me lots of trouble.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: