Comment by netcraft

3 days ago

Why would this be useful over of DuckDb? (earnest question)

They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”

Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.

I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.

If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat

Disclaimer: I am the PMC chair of DataFusion

There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html