Comment by robertk
9 hours ago
You don't know what you are talking about. Obviously refusal circuitry does not live in one layer, but the repo is built on a paper with sound foundations from an Anthropic scholar working with a DeepMind interpretability mentor: https://scholar.google.com/citations?view_op=view_citation&h...
No comments yet
Contribute on Hacker News ↗