Comment by michaelmior

2 years ago

Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.

7 comments

michaelmior

wenc 2 years ago

Spark is still pretty non performant.

If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.

cmiles74 2 years ago
My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.
This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.
- wenc 2 years ago
  
  Yes. My job involves pulling a ton of data off Redshift into Parquet files, and then working with them using DuckDB (sooo much faster — DuckDB is parallelized, vectorized and just plain fast on Parquet datasets)
- physicles 2 years ago
  
  Why Postgres? DuckDB is column-based and Postgres is row-based. For analytics workloads, I’m having a hard time thinking of a scenario where Postgres wins in terms of performance.
  If your data is too big to fit into DuckDB, consider Clickhouse, which is also column-based and understands standard SQL.
  
  1 reply →

hobs 2 years ago

In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.

Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.

michaelmior 2 years ago

I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.