Comment by jamesblonde
3 days ago
There is a cambrian explosion in data processing engines: DataFusion, Polars, DuckDB, Feldera, Pathway, and more than i can remember.
It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.
I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.
Spark for sure I view with suspicion and avoid as much as possible at work.
SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.
I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.
Absolutely agree. Spark is the same garbage as Hadoop but in-memory.
just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.
3 replies →
Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine: https://datafusion.apache.org/comet/user-guide/overview.html
As someone that confortably ignored NoSQL hype, I am not worried.
I don't think SQL is going anyware. There might me abstactions that use these engines but you write SQL (a là dbt) before people get used to 10 APIs for the same.
What Spark has going for it is its ecosystem. Things like Delta and Iceberg are being written for Spark first. Look at PyIceberg for example
Did you really put SQL and Spark in the same basket?