← Back to context

Comment by kovek

8 hours ago

Maybe you can't 100% know what every layer "thinks", if you go through all the layers, you might see a cohesive "thinking" story. So, if there is any information you lose at layer N, you might learn some of it in layer N+1. The masking in the layers is not deterministic so the model can't really consistently lie throughout the layers. It doesn't chose what information we get to inspect. There might be a game of whack-a-mole, but you might get a general sentiment. I think the more layers there are, the more the model itself can hide very nuanced lies (But by that time we'd have a better mind-reading model).

However, I haven't read about it yet. I'm really excited to look into it!