Comment by lostmsu
2 days ago
> and in all three cases I observed quality degradation (when training from scratch).
At the same model size and training FLOPS?
2 days ago
> and in all three cases I observed quality degradation (when training from scratch).
At the same model size and training FLOPS?
No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.