← Back to context

Comment by jamesblonde

3 days ago

There is a cambrian explosion in data processing engines: DataFusion, Polars, DuckDB, Feldera, Pathway, and more than i can remember.

It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.

I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.

Spark for sure I view with suspicion and avoid as much as possible at work.

SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.

I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.

  • Absolutely agree. Spark is the same garbage as Hadoop but in-memory.

    • just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:

        spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
        spark.read.parquet("sample_data").groupBy($"col").count().count()
      

      after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.

      3 replies →

I don't think SQL is going anyware. There might me abstactions that use these engines but you write SQL (a là dbt) before people get used to 10 APIs for the same.

What Spark has going for it is its ecosystem. Things like Delta and Iceberg are being written for Spark first. Look at PyIceberg for example