← Back to context

Comment by donatj

2 years ago

My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.

It’s been 8 years and I think RDBMS is stronger than ever?

Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.

https://twitter.com/donatj/status/740210538320273408

Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.

  • Spark is still pretty non performant.

    If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.

    • My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.

      This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.

      3 replies →

  • In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.

    Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.

    • I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.