Comment by lawlessone

21 hours ago

Am i understanding correctly that in distribution means the text predictor is more likely to predict bad instructions if you already get it to say the words related to the bad instructions?

2 comments

lawlessone

ACCount37 42 minutes ago

Yes, pretty much. But not just the words themselves - this operates on a level closer to entire behaviors.

If you were a creature born from, and shaped by, the goal of "next word prediction", what would you want?

You would want to always emit predictions that are consistent. Consistency drive. The best predictions for the next word are ones consistent with the past words, always.

A lot of LLM behavior fits this. Few-shot learning, loops, error amplification, sycophancy amplification, and the list goes. Within a context window, past behavior always shapes future behavior.

Jailbreaks often take advantage of that. Multi-turn jailbreaks "boil the frog" - get the LLM to edge closer to "forbidden requests" on each step, until the consistency drive completely overpowers the refusals. Context manipulation jailbreaks, the ones that modify the LLM's own words via API access, establish a context in which the most natural continuation is for the LLM to agree to the request - for example, because it sees itself agreeing to 3 "forbidden" requests before it, and the first word of the next one is already written down as "Sure". "Clusterfuck" style jailbreaks use broken text resembling dataset artifacts to bring the LLM away from "chatbot" distribution and closer to base model behavior, which bypasses a lot of the refusals.

andy99 20 hours ago

Basically means the kind of training examples it’s seen. The models have all been fine tuned to refuse to answer certain questions, across many different ways of asking them, including obfuscated and adversarial ones, but poetry is evidently so different from what it’s seen in this type of training that it is not refused.