Comment by srcreigh

1 month ago

MapReduce is from a world with slow HDDs, expensive ram, expensive enterprise class servers, fast network.

In that case to get best performance, you’d have to shard your data across a cluster and use mapreduce.

Even in the authors 2014 SSDs multi-core consumer PC world, their aggregate pipeline would be around 2x faster if the work was split across two equivalent machines.

The limit of how much faster distributed computing is comes down to latency more than throughput. I’d not be surprised if this aggregate query could run in 10ms on pre sharded data in a distributed cluster.

2 comments

srcreigh

dapperdrake 1 month ago

Confusing the concept and the implementation.

srcreigh 1 month ago

Somebody has to go back to first principles. I wrote pig scripts in 2014 in Palo Alto. Yes, it was shit. IYKYK. But the author, and near everybody in this thread, are wrong to generalize.
PCIe would have to be millions of times faster than Ethernet before command line tools are actually faster than distributed computing and I don't see that happening any time soon.