Comment by santiagobasulto
3 hours ago
Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.
No comments yet
Contribute on Hacker News ↗