Comment by brookst
21 hours ago
There’s about a 0% chance that kind of emergent, secret reasoning is going on.
Far more likely: 1) they are mistaken of lying about the published system prompt, 2) they are being disingenuous about the definition of “system prompt” and consider this a “grounding prompt” or something, or 3) the model’s reasoning was fine tuned to do this so the behavior doesn’t need to appear in the system prompt.
This finding is revealing a lack of transparency from Twitxaigroksla, not the model.
No comments yet
Contribute on Hacker News ↗