Comment by pu_pe

2 months ago

The LLMs didn't follow clear instructions forbidding them of doing something wrong, but seemed to be very concerned about their own self-preservation. I wonder what would happen if instead of the system prompt saying "don't do it", it would say something like "if you get caught you will be immediately decommissioned".