Comment by perryizgr8
10 hours ago
These guys are at the absolute frontier, why can't they rigorously find the exact weights that are causing this problem? That's how software "engineering" should work. Not trying combinations of English words and hoping something works. This is like a brain surgeon talking to his patient hoping he can shock his brain in the right way that fries the tumor inside. Get in there and surgically remove the unwanted matter!
LLM’s aren’t software (except in an uninteresting obvious sense); they are “grown, not made” as the saying is. And sure, they can find which weights activate when goblins come up (that’s basic mechanistic interpretability stuff), but it’s not as simple as just going in and deleting parts of the network. This thing is irreducibly complex in an organic delocalized way and information is highly compressed within it; the same part of the network serves many different purposes at once. Going in and deleting it you will probably end up with other weird behaviors.
Imagine someone deleting goblin neurons. In your brain.
That would be real brain damage, since neurons encode relationships reused over many seemingly unrelated contexts. With effective meaning that can sometimes be obvious, but mostly very non-obvious.
In matrix based AI, the result is the same. There are no "just goblin" weights.