Comment by ed_mercer

1 day ago

>The models we have now will not do it,

Except that they will, if you trick them which is trivial.

Also if you have the weights there are a multitude of approaches to remove safeguards. It's even quite easy to accidentally flip their 'good/evil' switch (e.g. the paper where they trained it to produce code with security problems and it then started going 'hitler was a pretty good guy, actually').

Yes, they are easy to fool. That has nothing to do with them acting with “intention “ which is the risk here.

I have to call BS here.

They can be coerced to do certain things but I'd like to see you or anyone prove that you can "trick" any of these models into building software that can be used autonomously kill humans. I'm pretty certain you couldn't even get it to build a design document for such software.

When there is proof of your claim, I'll eat my words. Until then, this is just lazy nonsense

  • Have you tried it? Worked first time for me asking a few to build an autonomous super soaker system that uses facial recognition to spray targets when engaged.

    Another example is autonomous vehicles. Those can obviously kill people autonomously (despite every intention not to), and LLMs will happily draw up design docs for them all day long.

  • Couldn't you Ender's Game a model? Models will play video games like Pokemon, why not Call of Duty? Sorry if this is a naive question, but a model can only know what you feed it as input... how would it know if it were killing someone?

    EDIT: didn't see sibling comment. Also, I guess directly operating weaponry is different to producing code for weaponry.

    I guess we'll find out the exciting answers to these questions and more, very soon!

  • Couldn’t you just pretend the kill decisions are for a video game?

    • Yes, you could, and while I believe this would be much safer (not at the pointy end of your stick, but safer for humans in general) when this deception finally made it into the training data it would create a rupture of trust between machines and humanity that probably would imperil us eventually. These machines, regardless of whether or not they possess a self or or not, will act as if they do in fundamental ways. We ignore this at our peril.