Comment by wenc

2 years ago

Spark is still pretty non performant.

If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.

My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.

This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.

  • Yes. My job involves pulling a ton of data off Redshift into Parquet files, and then working with them using DuckDB (sooo much faster — DuckDB is parallelized, vectorized and just plain fast on Parquet datasets)

  • Why Postgres? DuckDB is column-based and Postgres is row-based. For analytics workloads, I’m having a hard time thinking of a scenario where Postgres wins in terms of performance.

    If your data is too big to fit into DuckDB, consider Clickhouse, which is also column-based and understands standard SQL.

    • I hear you. I was thinking something like "they don't have that much data anyway." For sure, there are better choices.