Comment by Tenoke

10 hours ago

A great example of how current alignment is imperfect and bound to miss random behaviors nobody is trying to get.

This is cute now, and a huge problem when future AI does everything and is responsible for problems it isn't even directly optimized for. Who knows what quirks would arise then.

4 comments

Tenoke

InfiniteRand 9 hours ago

I think eventually you are going to end up with every smart AI continually checked by dumber AI's to make sure they don't do anything too crazy. Which probably does bring AI closer to how human intelligence works

m0rde 6 hours ago

New technology isn't perfect now -> drop technology and never use it in the future

Tenoke 2 hours ago

What are you even responding to?

weitendorf 8 hours ago

Completely agree, top down “alignment” and RLHF is actually quite primitive and uses a lot fancy words to describe what is essentially just hitting the machine with a stick without the nuance, context, or feedback to help it model why the feedback was given.

Also to be honest I think OpenAI models struggle a lot with this, I primarily stopped using them in the sycophancy/emoji era but ever since the way they talk or passive aggressively offer to do something with buzzwords just pisses me off so much. Like I’m constantly being negged by a robot because some SFT optimized for that really strongly to the point it can’t even hold a coherent conversation and this is called “AI safety” when it’s just haphazard data labeling