Comment by jampekka
3 months ago
In short yes. That's how the raw base models trained to replicate the internet are turned into chatbots in general. Making it to refuse to talk about some things is technically no different.
There are multiple ways to do this: humans rating answers (e.g. Reinforcement Learning from Human Feedback, Direct Preference Optimization), humans giving example answers (Supervised Fine-Tuning) and other prespecified models ranking and/or giving examples and/or extra context (e.g. Antropic's "Constitutional AI").
For the leading models it's probably mix of those all, but this finetuning step is not usually very well documented.
No comments yet
Contribute on Hacker News ↗