Comment by NiloCK
5 hours ago
Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.
It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.
No comments yet
Contribute on Hacker News ↗