Comment by liteclient
12 hours ago
it makes sense architecturally
they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute
that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far
Yup, keyword here is “under the right conditions”.
This may work well for their use case but fail horribly in others without further peer review and testing.