Comment by hn_acc1

19 days ago

My intuition: the value of a cell can represent the result of multiple (many) time units used to compute something. If you cannot store enough intermediate results, you may end up needing to recalculate the same / similar results over and over - at least in some algorithms. So one cell can represent the results of hundreds of time units, and being able to store / load that one value to re-use it later can then replace those same hundreds of time units. In effect, space can be used for "time compression" (like a compressed file) when the time is used to compute similar values multiple times.

If intermediate results are entirely uncorrelated, with no overlap in the work at all, that would not hold - space will not help you. Edit: This kind of problem is very rare. Think of a cache with 0 percent hit rate - almost never happens.

And you can't really do it the other way around (at least not in current computing terms / concepts): you cannot use a single unit of time as a standin / proxy for hundreds of cells, since we don't quite have infinitely-broad SIMD architectures.

There's many algorithms with a space vs time tradeoff. But what works best, depends a lot on the time/energy cost of re-computing something, vs the storage/bandwidth cost of caching results.

Expensive calculation, cheap storage → caching results helps.

Limited bandwidth / 'expensive' storage, simple calculation (see: today's hyper-fast CPU+L1 cache combo's) → better to re-compute some things on the fly as needed.

I suspect there's a lot of existing software (components) out there designed along the "save CPU cycles, burn storage" path, where in modern reality a "save storage, CPU cycles are cheap" would be more effective. CPU speeds have grown way way faster than main memory bandwidth (or even size?) over the last decades.

For a datacenter, supercomputer, embedded system, PC or some end-user's phone, the metrics will be different. But same principle applies.

As I understand it, this is the essential insight behind dynamic programming algorithms; the whole idea is to exploit the redundancies in a recursive task by memoizing the partial results of lower order stages.

I think this theorem applies well for modern LLMs: large language model with pre-computed weights can be used to compute very complex algorithms that approximate human knowledge, that otherwise were impossible or would have required many orders more compute to calculate