Comment by albert_e

13 hours ago

If a tiny misconfiguration of reward system can cause such noticeable annoyance ...

What dangers lurk beneath the surface.

This is not funny.

This is the real nugget of wisdom here. This should be confirmation to everyone that no one understands the LLM internals and they are not aligned. When they are eventually given control to run things, they will behave in wildly unexpected ways, and past the point of being able to change them.

This is a worry that people have been talking about in various forms for a while now, and I think it's a gigantic one. The only reason this was caught is that the quirk was a very noticeable verbal one. When words like "goblin" and "gremlin" pop up it is easy for us to spot. If the quirk takes another shape (say, ranking certain people with certain features as less trustworthy) it might be too subtle or too weird for us to notice it. Would I ever notice if ChatGPT consistently rates people born in June to be untrustworthy?

Here is an academic paper discussing this kind of worry: https://link.springer.com/article/10.1007/s11023-022-09605-x