Comment by p1esk
3 days ago
No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.
No comments yet
Contribute on Hacker News ↗