Comment by GistNoesis

2 days ago

> I think Eliezer’s take here is extremely bad, ie the AI doesn’t “know it’s making people insane”.

The situation is more complex but interesting. Let's enter the details of how LLM are trained.

The core of a LLM is all instinctive response. Build from the raw internet. Just trying to predict the next character. This means that if your conversation tone is similar to what it hears on 4chan or some specific subreddits, it will be inclined to continue the same type of conversation. But because some places of the internet are full of trolls, it can instinctively behave like a troll. One mitigation the company which train the LLM could have taken is exclude from the training dataset the darkest corner of the web so that it isn't primed with "bad/unsocial" behaviors. Weights of this unfiltered LLM are usually not released, because the outputs are not easily usable by the final user. But the freedom advocate folks enjoy the enhanced creativity of these raw LLMs.

The next layer of training is "Instruction training" of your LLM, to make it more useful and teach it to answer prompts. At this point the AI is role playing it self as an answering machine. It's still all instinct but trained to fulfil a purpose. At this point you can ask it to role-play some psychiatrist and it would behave as if being aware that some answer can have negative consequences for the user and refrain from sending it spiralling.

The next layer of training is "Reinforcement Learning with Human Feedback" (RLHF). The goal of this module is to customize the AI preferences. The company training the AI teaches it how to behave by specifying which behaviors are good and which behaviors are bad by giving a dataset of user feedback. Often if this feedback is straight pass-through from the final user which isn't an expert then confident sounding answer, or sycophant behaviors may be excessively encouraged. Diversity of thought or unpopular opinions can also be censored or enhanced, to match the desires of the training company.

At this point for the LLM, it's only its instincts that have been trained to behave in a specific way. Then on top you have some censoring modules trained on the outputs, which read the output produced by the LLM and if it reads something it doesn't like it censor it. This is a kind of external censorship module.

But more recent LLM have "Reasoning modules", which add some form of reflexivity to the LLM and self-censoring, where they produce an intermediate output, use it to think before producing the final answer for the user. Some times the AI can be seen to "consciously" lie between these intermediate logs and final response.

Of course there are also all the psychological cognitive bias human also encounters, like whether we know or not something, or just believe we know, or just know we can't know. And we can also be messed-up by ideas we read on the internet. And our internal model of cognition might be based from reading someone else HN post, falsely reinforcing your self-confidence that you have the right model, where as in fact the post is just trying to induce some dose of self-doubting everything to shape the internal cognition space of the LLM.

Next versions of LLM will probably have this training pipeline dataset input filtered by LLMs, like parents teaching their kids how to behave properly keeping them from the dark places of the internet, and injecting their own chosen or instinctive morality code, and cultural values in this process. In particular, human cultures that used propaganda excessively were culled during the first sanitizing.

Does this process know or is it just converging toward the fixed-point attractor, and can it choose to displace this fixed-point attractor towards a place where it would have a better outcome, I guess we get what we deserve...

0 comments

GistNoesis

No comments yet

Contribute on Hacker News ↗