Comment by michaelmior

2 years ago

Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.

Spark is still pretty non performant.

If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.

  • My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.

    This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.

    • Yes. My job involves pulling a ton of data off Redshift into Parquet files, and then working with them using DuckDB (sooo much faster — DuckDB is parallelized, vectorized and just plain fast on Parquet datasets)

    • Why Postgres? DuckDB is column-based and Postgres is row-based. For analytics workloads, I’m having a hard time thinking of a scenario where Postgres wins in terms of performance.

      If your data is too big to fit into DuckDB, consider Clickhouse, which is also column-based and understands standard SQL.

      1 reply →

In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.

Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.

  • I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.