Comment by XenophileJKO
7 days ago
As someone that has built many of these systems, it doesn't remove the tendency or "impulse" to act. Removing the affordance may "lower" the probability of the action, but it increases the probability that the model will misuse another tool and try to accomplish the same action.
Ya, makes sense—if the model is trained just to "be helpful," removing the tool forces it to improvise. I’m thinking this is where the architecture feeds back into the training/RLHF. We train the model to halt reasoning in that action space if the specific tool is missing. This changes the safety problem from training the model to understand complex permission logic to training the model to respect a binary absence of a tool.