Comment by bdndndndbve

6 months ago

Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.

1 comment

bdndndndbve

ignoreusernames 6 months ago

> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations

I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk

> [...] which used to be a point of comparison to MapReduce specifically.

So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop

> Not ground-breaking nowadays but when I was doing this stuff 10+ years

I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating