Comment by netcraft

6 months ago

Why would this be useful over of DuckDb? (earnest question)

10 comments

netcraft

They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”

Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.

jitl 6 months ago

We tried both about 8 months ago, at the time DuckDB’s Node driver leaked memory and segfaulted, DataFusion was missing some features we wanted. But they are both improving rapidly.
geysersam 6 months ago
> DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year
Really? I don't see it near the top.
[CH benchmarks](https://benchmark.clickhouse.com/#eyjzexn0zw0ionsiqwxsb3leqi...)
- alamb 6 months ago
  
  Specifically, DataFusion is faster when querying parquet directly.
  Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)
- kalendos 6 months ago
  
  You might need to adjust filters to do an apple to apple comparison.
  https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...
  
  4 replies →

alamb 6 months ago

I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.

If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat

Disclaimer: I am the PMC chair of DataFusion

There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html