← Back to context

Comment by llm_trw

17 days ago

To solve mnist without mathematical tricks like convolutions or attention heads you would nees 2.5e42 weights. Assuming that you're using 16 bit weights that 5e42 bytes. A yotta byte is 10e24.

That is you'd need 5 exa yotta bytes to solve it.

Currently the whole world has around 200 zetabytes of storage.

I short for the next 120 years mnist will need mathematical tricks to be solved.

The distinction that i think is important to make when talking about "the bitter lesson" is that improving the compute and training infrastructure and tricks in the abstract wins over intelligent model and system design.

Its more about the information about the specific problem you are solving having less impact than techniques that target the compute. So in this case, breaking down how to parse a PDF in stages for your domain is involving specific expert knowledge of the domain, but training with attention is about efficient use of compute in general; with no domain expertise.