Comment by simianwords

13 days ago

In theory, the models have done alignment training to not do something malicious.

Can you get it to do something malicious? I'm not saying it is not unsafe, but the extent matters. I would like to see a reproduceable example.

1 comment

simianwords

dgunay 13 days ago

I ran an experiment at work where I was able to adversarially prompt inject a Yolo mode code review agent into approving a pr just by editing the project's AGENTS.md in the pr. A contrived example (obviously the solution is to not give a bot approval power) but people are running Yolo agents connected to the internet with a lot of authority. It's very difficult to know exactly what the model will consider malicious or not.