Comment by XenophileJKO

1 month ago

As someone that has built many of these systems, it doesn't remove the tendency or "impulse" to act. Removing the affordance may "lower" the probability of the action, but it increases the probability that the model will misuse another tool and try to accomplish the same action.

1 comment

XenophileJKO

csemple 1 month ago

Ya, makes sense—if the model is trained just to "be helpful," removing the tool forces it to improvise. I’m thinking this is where the architecture feeds back into the training/RLHF. We train the model to halt reasoning in that action space if the specific tool is missing. This changes the safety problem from training the model to understand complex permission logic to training the model to respect a binary absence of a tool.