Comment by thebuilderjr
6 days ago
DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.
And the thing is - single node can still scale ridiculously high without the orchestration overheads of distributed stuff.
You can do dual AMD 192 core CPU's (384 cores / 768 threads) with 9 TB of memory and a 24 disk SSD array in a 2U box.
The vast majority of businesses will never need more than single node architecture. Hardware advances are continually increasing that percentage.
SPARK and its modern counterpart Databricks are essentially obsolete for these organizations. Whatever justification they may have had in the past is no longer true.
I’ve recently closed down several in house SPARK clusters and replaced them with single nodes.
In addition to the simplicity of the design and reduction in cost there was a massive increase in performance. I expect this will become more common in the future; leaving distributed architecture for a small and increasingly niche group.
Exactly, datafusion is implied batteries included apache bigdata ecosystem. Polars is chasing the Python Pandas crowd and uses python syntax, handy if you're already comfortable with ipython.
Can't you use DataFusion single node/without any Apache ecosystem stuff? They have a Python library and DataFusion is "just" a query engine. (If anything, I'd call Pandas the batteries included option...)
I think the difference is more that DataFusion is built as a library so you can plug it into the product you're building (e.g. Comet, which plugs it into Spark, or pg_lakehouse, which plugs it into Postgres). Polars could be used that way, but it's also a functional package you can pip install and use as a Pandas alternative right now.
"pg_analytics (formerly named pg_lakehouse) puts DuckDB inside Postgres" https://github.com/paradedb/pg_analytics
2 replies →