Comment by JoshuaDavid

2 years ago

Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features

1 comment

JoshuaDavid

Whoa, this is super cool! I can imagine if we had something like this for ChatGPT, we could use it to do some serious prompt engineering. Imagine seeing what specific neurons you were activating with your prompt, and being able to identify which word in your prompt was triggering an undesired behavior. Super cool stuff, excited to see if it scales