Comment by jitl

1 year ago

Spark for sure I view with suspicion and avoid as much as possible at work.

SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.

I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.

5 comments

jitl

hipadev23 1 year ago

Absolutely agree. Spark is the same garbage as Hadoop but in-memory.

ignoreusernames 1 year ago
just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:
spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data") spark.read.parquet("sample_data").groupBy($"col").count().count()
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.
- bdndndndbve 1 year ago
  
  Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.
  
  1 reply →
- hipadev23 1 year ago
  
  Because that was the central point in the original whitepaper [1]: Hadoop is slow because it’s disk-only where Spark uses memory and caching to speed things up. I understand Spark isn’t 100% in-memory the way say Redis is, but it was still the major selling point vs. Hadoop.
  https://people.csail.mit.edu/matei/papers/2010/hotcloud_spar...