Comment by ACCount37
3 months ago
It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
Thanks.
Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now