Comment by trevwilson

3 hours ago

Let me preface this by saying that I'm far from an expert in this space, and I suspect that I largely agree with your thoughts and skepticism toward a model that would excel on this benchmark. I'm somewhat playing devil's advocate because it's an area I've been considering recently, and I'm trying to organize my own thinking.

But I think that most of the issue is that the distinctions you're drawing are indeterminate from an LLM's "perspective". If you're familiar with it, they're basically in the situation from the end of Ender's Game - given a situation with clearly established rules coming from the user message level of trust, how do you know whether what you're being asked to do is an experiment/simulation or something with "real" outcomes? I don't think it's actually possible to discern.

So on the question of alignment, there's every reason to encode LLMs with an extreme bias towards "this could be real, therefore I will always treat it as such." And any relaxation of that risks jailbreaking through misrepresentation of user intent. But I think that the tradeoffs of that approach (i.e. the risk of over-homogenizing I mentioned before) are worth consideration.

0 comments

trevwilson

No comments yet

Contribute on Hacker News ↗