Comment by csemple

1 month ago

Ya, makes sense—if the model is trained just to "be helpful," removing the tool forces it to improvise. I’m thinking this is where the architecture feeds back into the training/RLHF. We train the model to halt reasoning in that action space if the specific tool is missing. This changes the safety problem from training the model to understand complex permission logic to training the model to respect a binary absence of a tool.

0 comments

csemple

No comments yet

Contribute on Hacker News ↗