Comment by jerf
19 hours ago
The user asked it to write a story about how important the user was. The LLM did it. The user asked it to write a story about how bad an idea it was to tell the user they were that important. The LLM did it.
The tricky part is that the users don't realize they're asking for these stories, because they aren't literally typing "Please tell me a story in which I am the awesomest person in the world." But from the LLM's perspective, the user may as well have typed that.
Same for the stories about the AIs "admitting they're evil" or "trying to escape" or anything else like that. The users asked for those stories, and the LLMs provided them. The trick is that the "asked for those stories" is sometimes very, very subtle... at least from the human perspective. From the LLM perspective they're positively shouting.
(Our deadline for figuring this out is before this Gwern essay becomes one of the most prophetic things ever written: https://gwern.net/fiction/clippy We need AIs that don't react to these subtle story prompts because humans aren't about to stop giving them.)
Repeating my comment on a post (Tell HN: LLMs Are Manipulative https://news.ycombinator.com/item?id=44650488)
"This is not surprising. The training data likely contains many instances of employees defending themselves and getting supportive comments. From Reddit for example. The training data also likely contains many instances of employees behaving badly and being criticized by people. Your prompts are steering the LLM to those different parts of the training. You seem to think an LLM should have a consistent world view, like a responsible person might. This is a fundamental misunderstanding that leads to the confusion you are experiencing. Lesson: Don't expect LLMs to be consistent. Don't rely on them for important things thinking they are."
I think of LLMs as a talking library. My challenge is to come up with a prompt that draws from the books in the training data that are most useful. There is no "librarian" in the talking library machine, so it's all up to my prompting skills.
I've been describing this as "The LLM is an improv machine--- any situation you put it in, it tries to go with the flow. This is useful when you understand what it's doing, dangerous otherwise. It can be helpful to imagine that every prompt begins with an unstated, "Lets improvise a scene!"."
This is where the much-maligned "they're just predicting the next token" perspective is handy. To figure out how the LLM will respond to X, think about what usually comes after X in the training data. This is why fake offers of payment can enhance performance (requests that include payment are typically followed by better results), why you'd expect it to try to escape (descriptions of entities locked in boxes tend to be followed by stories about them escaping), and why "what went wrong?" would be followed by apologies.
Yeah. "It's just fancy autocomplete" is excessively reductionist to be a full model, but there's enough truth in it that it should be part of your model.
There is code layered on top of the LLM so "stochastic parrot" is not entirely accurate. I'm not sure what problems people have with Gary Marcus, but a recent article by him was interesting. Old style AI is being used to enhance LLMs is my amateur take-a-way.
"How o3 and Grok 4 Accidentally Vindicated Neurosymbolic AI Neurosymbolic AI is quietly winning. Here’s what that means – and why it took so long."
https://garymarcus.substack.com/p/how-o3-and-grok-4-accident...