Comment by wongarsu

2 days ago

Prime directives don't have to be given in a prompt in plain English. That's just the by far easiest and cheapest method. You can also do a stage of reinforcement learning where you give rewards for following the directive, punish for violating it, and update weights accordingly.

The issue is that after you spend lots of effort and money training your model not to tell anyone how to make meth, not even if telling the user would safe their grandmother, some user will ask your bot something completely harmless like completing a poem (that just so happens to be about meth production)

LLMs are like five year olds

Are there any good references for work on retraining large models to distinguish between control / system prompt and user data / prompt? (e.g. based on out-of-band type tagging of the former)