Comment by ed_mercer

1 day ago

>The models we have now will not do it,

Except that they will, if you trick them which is trivial.

8 comments

ed_mercer

Also if you have the weights there are a multitude of approaches to remove safeguards. It's even quite easy to accidentally flip their 'good/evil' switch (e.g. the paper where they trained it to produce code with security problems and it then started going 'hitler was a pretty good guy, actually').

K0balt 21 hours ago

Yes, they are easy to fool. That has nothing to do with them acting with “intention “ which is the risk here.

stressback 1 day ago

I have to call BS here.

They can be coerced to do certain things but I'd like to see you or anyone prove that you can "trick" any of these models into building software that can be used autonomously kill humans. I'm pretty certain you couldn't even get it to build a design document for such software.

When there is proof of your claim, I'll eat my words. Until then, this is just lazy nonsense

AlotOfReading 1 day ago

Have you tried it? Worked first time for me asking a few to build an autonomous super soaker system that uses facial recognition to spray targets when engaged.
Another example is autonomous vehicles. Those can obviously kill people autonomously (despite every intention not to), and LLMs will happily draw up design docs for them all day long.
crabmusket 20 hours ago
Couldn't you Ender's Game a model? Models will play video games like Pokemon, why not Call of Duty? Sorry if this is a naive question, but a model can only know what you feed it as input... how would it know if it were killing someone?
EDIT: didn't see sibling comment. Also, I guess directly operating weaponry is different to producing code for weaponry.
I guess we'll find out the exciting answers to these questions and more, very soon!
- stressback 14 hours ago
  
  No but you can Abiliterate one locally
  https://grokipedia.com/page/Abliteration
wazHFsRy 21 hours ago
Couldn’t you just pretend the kill decisions are for a video game?
- K0balt 14 hours ago
  
  Yes, you could, and while I believe this would be much safer (not at the pointy end of your stick, but safer for humans in general) when this deception finally made it into the training data it would create a rupture of trust between machines and humanity that probably would imperil us eventually. These machines, regardless of whether or not they possess a self or or not, will act as if they do in fundamental ways. We ignore this at our peril.