Comment by varun764

13 hours ago

Working on AI interpretability infrastructure! Starting with hallucination/failure mode detection and causal tracing in transformer-based foundation models. It's non-SAE, domain-agnostic, and works for any kind of tokens or sequence data (e.g. not just English text - biological sequences, physics data, etc). Some of our early tests indicate that this detection setup could broadly work well for any property, but we've mostly validated on hallucinations. If you have access to self-trained or open-weight models, would love to have you try out the alpha version - running it through HuggingFace!