Comment by unshavedyak
2 years ago
I'm not one to mind the guardrails - but what i hate is something you mentioned, fighting the tool.
Eg "Do an X-like thing" where X is something it may not be allowed to do, gets rejected. But then i say "Well, of course - that's why i said X-like. Do what you can do in that direction, so that it is still okay".
Why do i even have to say that? I get why, but still - just expressing my frustration. I'm not trying to push boundaries, and i'm usually happy to ignore the off limits stuff. But when it so easily collides with "actually okay but just near the off limits stuff" then that makes a whole bunch of other -- actually okay -- stuff randomly off limits as well.
This reminds me of everyday interactions on StackOverflow. "Yes, I really really really do want to use the library and language I mentioned."
This is a great point, and something that may be at least partially addressable with current methods (e.g. RLHF/SFT). Maybe (part of) what's missing is a tighter feedback loop between a) limitations experienced by the human users of models (e.g. "actually okay but just near the off limits stuff"), and b) model training signal.
Thank you for the insightful perspective!