Comment by ignoreusernames
3 days ago
just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:
spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
spark.read.parquet("sample_data").groupBy($"col").count().count()
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.
Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.
> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations
I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk
> [...] which used to be a point of comparison to MapReduce specifically.
So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop
> Not ground-breaking nowadays but when I was doing this stuff 10+ years
I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating
Because that was the central point in the original whitepaper [1]: Hadoop is slow because it’s disk-only where Spark uses memory and caching to speed things up. I understand Spark isn’t 100% in-memory the way say Redis is, but it was still the major selling point vs. Hadoop.
https://people.csail.mit.edu/matei/papers/2010/hotcloud_spar...