Comment by jitl

3 days ago

Spark for sure I view with suspicion and avoid as much as possible at work.

SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.

I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.

Absolutely agree. Spark is the same garbage as Hadoop but in-memory.

  • just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:

      spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
      spark.read.parquet("sample_data").groupBy($"col").count().count()
    

    after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.

    • Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.

      1 reply →