Comment by nullc

20 hours ago

Classically the training process is entirely about imitation and not at all about reasoning.

Imagine you're training an LLM (a text predictor) on a corpus consisting of "The AI agent was switched on and then ran the command {takeover world}. This act immediately activated the safeguards and the AI was suddenly erased from existence."

Assuming the training was successful, prompting the AI with "The AI agent was switched on and then ran the command" is going to get the continuation "{takeover world}". The fact that it has bad consequences for the AI in the story is irrelevant-- the most likely next token remains "{takeover world}".

Because of the deep abstraction spaces that LLMs learn internally the same wrong behavior can be applied in a multitude of contexts-- it doesn't have to be a literal string match, but thinking about the literal string match is a good way to get an intuition for the behavior and its inevitability.

Reinforcement learning can help bias against those outcomes, but it can be context sensitive because the adjustment may not end up completely flipping the evil bit-- the RL might just train it to act not evil in specific contexts (and usually somewhere in between).

In the future we're likely to see LLMs trained more on synthetic content, where an existing AI looks at training material, uses rag and other tools, and then constructions simulated transcripts of 'ideal' LLM behavior, then conducts a review of the transcript with many different criteria. Training is then performed on the review-passing simulations, rather than on any direct content. In that case the training process would be able to integrate the 'lesson' and avoid teaching the unhelpful behavior at all.

This approach also has the advantage that rather than a one-hot "the right next token" result the simulated training material can directly train a distribution over the next token, which is much more efficient.

One can also do cute tricks like, take a partially trained model that hasn't yet learned a lesson then train it on the lesson, invert the difference and apply it to make a "wrong think" model. Then have a supervisor model inspect the reasoning transcript of the wrongthinker, and interrupt its reasoning transcripts with "No, <reason to the above is wrong/bad>". Then train on the corrections without ever training on the bad-prefix-- so you don't train it to think the wrong thing, but do train it to correct itself if sampling noise causes it to do so by chance.

There is a little bit of a bootstrapping challenge because to generate the required quantity and diversity of ideal training material you need a sufficiently powerful AI to begin with.