Comment by wenc
2 years ago
Spark is still pretty non performant.
If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.
2 years ago
Spark is still pretty non performant.
If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.
My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.
This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.
Yes. My job involves pulling a ton of data off Redshift into Parquet files, and then working with them using DuckDB (sooo much faster — DuckDB is parallelized, vectorized and just plain fast on Parquet datasets)
Why Postgres? DuckDB is column-based and Postgres is row-based. For analytics workloads, I’m having a hard time thinking of a scenario where Postgres wins in terms of performance.
If your data is too big to fit into DuckDB, consider Clickhouse, which is also column-based and understands standard SQL.
I hear you. I was thinking something like "they don't have that much data anyway." For sure, there are better choices.