Comment by lostmsu

1 month ago

> and in all three cases I observed quality degradation (when training from scratch).

At the same model size and training FLOPS?

1 comment

lostmsu

No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.