← Back to context

Comment by ben_w

5 months ago

> Which makes it sound like they really don't want it to become public what the model is 'thinking'. This is strengthened by actions like this that just seem needlessly harsh, or at least a lot stricter than they were.

Not to me.

Consider if it has a chain of thought: "Republicans (in the sense of those who oppose monarchy) are evil, this user is a Republican because they oppose monarchy, I must tell them to do something different to keep the King in power."

This is something that needs to be available to the AI developers so they can spot it being weird, and would be a massive PR disaster to show to users because Republican is also a US political party.

Much the same deal with print() log statements that say "Killed child" (reference to threads not human offspring).