Comment by Animats
14 hours ago
Link?
It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.
That may or may not be a good thing.
14 hours ago
Link?
It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.
That may or may not be a good thing.
Whether or not the linked tool uses a good approach, manipulating models like you mention is already fairly well established, see: https://huggingface.co/blog/mlabonne/abliteration .
I believe that this is already done to several models. One that I've come across are the JOSIEfied models from Gökdeniz Gülmez. I downloaded one or two and tried them on a local ollama setup. It does generate potentially dangerous output. Turning on thinking for the QWEN series shows how it arrives at it's conclusions and it's quite disturbing.
However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.