← Back to context Comment by triyambakam 2 years ago Is a linear probe part of observability/interpretability? 4 comments triyambakam Reply canjobear 2 years ago Yes, a pretty fundamental technique and one of the earliest. It lets you determine which layers contain what information among other things. Legend2440 2 years ago The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned. JoshuaDavid 2 years ago Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features 1 reply →
canjobear 2 years ago Yes, a pretty fundamental technique and one of the earliest. It lets you determine which layers contain what information among other things. Legend2440 2 years ago The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned. JoshuaDavid 2 years ago Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features 1 reply →
Legend2440 2 years ago The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned. JoshuaDavid 2 years ago Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features 1 reply →
JoshuaDavid 2 years ago Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features 1 reply →
Yes, a pretty fundamental technique and one of the earliest. It lets you determine which layers contain what information among other things.
The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned.
Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features
1 reply →