Comment by varun764
13 hours ago
Working on AI interpretability infrastructure! Starting with hallucination/failure mode detection and causal tracing in transformer-based foundation models. It's non-SAE, domain-agnostic, and works for any kind of tokens or sequence data (e.g. not just English text - biological sequences, physics data, etc). Some of our early tests indicate that this detection setup could broadly work well for any property, but we've mostly validated on hallucinations. If you have access to self-trained or open-weight models, would love to have you try out the alpha version - running it through HuggingFace!
No comments yet
Contribute on Hacker News ↗