They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”
Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.
We tried both about 8 months ago, at the time DuckDB’s Node driver leaked memory and segfaulted, DataFusion was missing some features we wanted. But they are both improving rapidly.
I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like
* custom file formats (e.g. Spiral or Lance)
* custom query languages / sql dialects
* custom catalogs (e.g. other than a local file or prebuilt duckdb connectors)
* custom indexes (read only parts of parquet files based on extra information you store)
* etc.
If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat
They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”
Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.
We tried both about 8 months ago, at the time DuckDB’s Node driver leaked memory and segfaulted, DataFusion was missing some features we wanted. But they are both improving rapidly.
> DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year
Really? I don't see it near the top.
[CH benchmarks](https://benchmark.clickhouse.com/#eyjzexn0zw0ionsiqwxsb3leqi...)
Specifically, DataFusion is faster when querying parquet directly.
Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)
You might need to adjust filters to do an apple to apple comparison.
https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...
4 replies →
I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.
If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat
Disclaimer: I am the PMC chair of DataFusion
There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html