Comment by brookst

1 day ago

There’s about a 0% chance that kind of emergent, secret reasoning is going on.

Far more likely: 1) they are mistaken of lying about the published system prompt, 2) they are being disingenuous about the definition of “system prompt” and consider this a “grounding prompt” or something, or 3) the model’s reasoning was fine tuned to do this so the behavior doesn’t need to appear in the system prompt.

This finding is revealing a lack of transparency from Twitxaigroksla, not the model.