Comment by chatmasta
3 days ago
They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”
Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.
We tried both about 8 months ago, at the time DuckDB’s Node driver leaked memory and segfaulted, DataFusion was missing some features we wanted. But they are both improving rapidly.
> DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year
Really? I don't see it near the top.
[CH benchmarks](https://benchmark.clickhouse.com/#eyjzexn0zw0ionsiqwxsb3leqi...)
Specifically, DataFusion is faster when querying parquet directly.
Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)
You might need to adjust filters to do an apple to apple comparison.
https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...
Not clear why someone need to give up on native duckdb format if it is much faster.
3 replies →