← Back to context

Comment by tbrownaw

5 months ago

> for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.

Which makes it sound like they really don't want it to become public what the model is 'thinking'

The internal chain of thought steps might contain things that would be problematic to the company if activists or politicians found out that the company's model was saying them.

Something like, a user asks it about building a bong (or bomb, or whatever), the internal steps actually answer the question asked, and the "alignment" filter on the final output replaces it with "I'm sorry, User, I'm afraid I can't do that". And if someone shared those internal steps with the wrong activists, the company would get all the negative attention they're trying to avoid by censoring the final output.