Comment by Animats

14 hours ago

Link?

It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.

That may or may not be a good thing.

2 comments

Animats

thegrim33 14 hours ago

Whether or not the linked tool uses a good approach, manipulating models like you mention is already fairly well established, see: https://huggingface.co/blog/mlabonne/abliteration .

noufalibrahim 14 hours ago

I believe that this is already done to several models. One that I've come across are the JOSIEfied models from Gökdeniz Gülmez. I downloaded one or two and tried them on a local ollama setup. It does generate potentially dangerous output. Turning on thinking for the QWEN series shows how it arrives at it's conclusions and it's quite disturbing.

However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.