They are both columnar data stores and while they solve the same problem I wouldn't use them in the same situation. DuckDB is often referred as the sqlite of analytics, meaning that it's lightweight and you can embed it. On the other hand ClickHouse is definitely the way to go if you need to distribute your queries over multiple servers.
If your workload can be held on a single server and you only need standard SQL functions both will serve you well. If you have more specific needs maybe you should have a look at the documentation. For example ClickHouse has a very extensive support for nested arrays which can prove quite useful.
Duckdb has also gotten mindshare as an engine to read Parquet from data lakes. The fact that it's embeddable enables some very creative uses. It helped that for a time DuckDB was substantially quicker than ClickHouse on reading Parquet. That advantage has eroded with recent improvements on ClickHouse Parquet support. I expect the gap will close quickly.
Scale. DuckDB chokes at a certain point (just like sqlite isn't the same as mysql or postgresql in terms of scalability). That's why they're building a better/bigger version.
Different beasts, but if by any chance you love ClickHouse already and just want to run OLAP queries in-process, there's chdb: https://github.com/chdb-io/chdb
They solve the same problem in that they are OLAP data stores, but that's where the similarity ends. Clickhouse is a centralised OLAP store (like 10s of others) whilst DuckDB is an embedded database that is usually ran in process.
What is it about DuckDB and it's strange cult like following? It's nice that it's in process, but then it's an incremental improvement over Pandas. Nice tool and well implemented but I don't see what is transformative about it.
They are both columnar data stores and while they solve the same problem I wouldn't use them in the same situation. DuckDB is often referred as the sqlite of analytics, meaning that it's lightweight and you can embed it. On the other hand ClickHouse is definitely the way to go if you need to distribute your queries over multiple servers. If your workload can be held on a single server and you only need standard SQL functions both will serve you well. If you have more specific needs maybe you should have a look at the documentation. For example ClickHouse has a very extensive support for nested arrays which can prove quite useful.
Duckdb has also gotten mindshare as an engine to read Parquet from data lakes. The fact that it's embeddable enables some very creative uses. It helped that for a time DuckDB was substantially quicker than ClickHouse on reading Parquet. That advantage has eroded with recent improvements on ClickHouse Parquet support. I expect the gap will close quickly.
also clickhouse-local exists...https://clickhouse.com/docs/en/operations/utilities/clickhou...
FWIW, you can checkout clickbench.com is a benchmark of parquet, partitioned of ClickHouse and DuckDB
Scale. DuckDB chokes at a certain point (just like sqlite isn't the same as mysql or postgresql in terms of scalability). That's why they're building a better/bigger version.
Different beasts, but if by any chance you love ClickHouse already and just want to run OLAP queries in-process, there's chdb: https://github.com/chdb-io/chdb
They solve the same problem in that they are OLAP data stores, but that's where the similarity ends. Clickhouse is a centralised OLAP store (like 10s of others) whilst DuckDB is an embedded database that is usually ran in process.
What is it about DuckDB and it's strange cult like following? It's nice that it's in process, but then it's an incremental improvement over Pandas. Nice tool and well implemented but I don't see what is transformative about it.
ClickHouse power is to have one binary that runs anywhere :
- local - server - cloud (*) - serverless - in-process https://github.com/chdb-io/chdb similar to DuckDB
(*) except for the forked cloud versions, ClickHouse Inc, Huawei, etc ...