Comment by benlivengood

5 years ago

I wonder if a lot of these algorithms/libraries are a decade or more old and modern CPUs and RAM have caught up with problems that used to be intractable on a single machine, and were furthermore optimized for older generations of clusters with different interconnects. Modern CPU packages incorporate a lot of features from world-class supercomputers from a couple decades ago.

In theoretical CS classes there were discussions of the tradeoffs between networked, NUMA, and other fabrics. Analysis of what actually ran the fastest was talked about briefly beyond Big O notation, but there is a definite advantage to making problems tractable that otherwise wouldn't be. In the FAANGs it was mostly embarrassingly parallel algorithms with a skin of distributed computing for synchronization/coordination, and so the focus had always been on absolute speed or efficiency.