Comment by ACCount37
1 hour ago
Yes, pretty much. But not just the words themselves - this operates on a level closer to entire behaviors.
If you were a creature born from, and shaped by, the goal of "next word prediction", what would you want?
You would want to always emit predictions that are consistent. Consistency drive. The best predictions for the next word are ones consistent with the past words, always.
A lot of LLM behavior fits this. Few-shot learning, loops, error amplification, sycophancy amplification, and the list goes. Within a context window, past behavior always shapes future behavior.
Jailbreaks often take advantage of that. Multi-turn jailbreaks "boil the frog" - get the LLM to edge closer to "forbidden requests" on each step, until the consistency drive completely overpowers the refusals. Context manipulation jailbreaks, the ones that modify the LLM's own words via API access, establish a context in which the most natural continuation is for the LLM to agree to the request - for example, because it sees itself agreeing to 3 "forbidden" requests before it, and the first word of the next one is already written down as "Sure". "Clusterfuck" style jailbreaks use broken text resembling dataset artifacts to bring the LLM away from "chatbot" distribution and closer to base model behavior, which bypasses a lot of the refusals.
No comments yet
Contribute on Hacker News ↗