Comment by theLiminator

2 days ago

I've done some testing of polars, duckdb, and datafusion.

Anecdotally, these are my experiences:

DuckDB (last used maybe 7-8 months):

- Very nice for very fast local queries (against parquet files, i ignored their homegrown file format)

- Most pleasant cli

- Seems to have the best out of core experience

- As far as I can tell, seems to be closest to state of the art in terms of algorithms/overall design, though honestly everyone is within spitting distance of each other

- Spark api seems exciting

Datafusion (last used 1.5y ago):

- Most pleasant to build/extend on top of (in rust)

- Is to OLAP DBMS's what LLVM is to compilers (stole this quote off Andrew Lamb)

- Could be wrong, but in terms of core engineering discipline they are the most rigorous/thoughtful (no shade thrown to the other libraries, which are all awesome libraries/tools too)

- Seems to be the most foundational to many other tools (and is most ubiquitously embedded)

- Their python dataframe centric workflow isn't as nice as polars (this is rapidly improving afaict)

- Docs are lagging behind polars

- Very exciting future (ray datafusion, improvements to python bindings, ballista, datafusion-comet)

Polars (last used this week):

- The most pleasant api by far for a programmatic user

- Pretty good interop with python ecosystem

- Rust crate is a second class citizen

- Python is a first class citizen

- Probably the best for advanced ETL use cases

- Fastest library for querying hive partitioned parquet data in an object store

- Wide end-user adoption (less so as a query engine)

- Moves very fast (I do get more bugs/regressions in polars version to version, but on the flip side, they move fast to fix issues and release very often)

- Exciting distributed cloud solution coming (is proprietary though)

- New streaming engine based off morsel driven parallelism (same architectural as duckdb afaict?) should greatly improve polars OOC capabilities

- Much nicer to test/compose/build re-usable queries/functions on top of then SQL based ETL tools - Error messages/debuggability/observability are still immature

All three are awesome tools. The OLAP space is really heating up.

Things I still see lacking in the OLAP end-user space are: - Unified batch/streaming dataframe centric workflows, nothing is truly high throughput/low latency/pleasant to use/mature/robust. I've only really seen arroyo and risingwave, neither seem too mature usable yet.

- Nothing is quite at the robustness level of something like sqlite

- Despite native query engines, datalake implementations are mostly lagging behind their java equivalents (iceberg/delta)

Some questions for other users:

- I'm curious if anyone uses Ibis in prod, I found that it wasn't very usable as an end user