This seems like a "turtles all the way down" kinda solution... What's to say you...

apike · on Dec 23, 2024

While this can be done in principle (it's not a foolproof enough method to, for example, ensure an LLM doesn't leak secrets) it is much harder to fool the supervisor than the generator because:

1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)

2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)

3. LLMs, like humans, are generally better at judging good output than generating good output

xandrius · on Dec 23, 2024

It would be interesting to see if there is a layout of supervisors to make sure this less prone to hijacking. Something like byzantine generals where you know a few might get fooled, so you can construct personalities which are more/less malliable and go for consensus.

This still wouldn't make it perfect but quite hard to study from an attacker's perspective.

ConspiracyFact · on Dec 23, 2024

"Who will watch the watchers?"

There is no good answer--I agree with you about the infinite regress--but there is a counter: the first term of the regress often offers a huge improvement over zero terms, even if perfection isn't achieved with any finite number of terms.

Who will stop the government from oppressing the people? There's no good answer to this either, but some rudimentary form of government--a single term in the regress--is much better than pure anarchy. (Of course, anarchists will disagree, but that's beside the point.)

Who's to say that my C compiler isn't designed to inject malware into every program I write, in a non-detectable way ("trusting trust")? No one, but doing a code review is far better than doing nothing.

What if the md5sum value itself is corrupted during data transfer? Possible, but we'll still catch 99.9999% of cases of data corruption using checksums.

Etc., etc.