Comment by donatj
2 years ago
My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.
It’s been 8 years and I think RDBMS is stronger than ever?
Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.
Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.
Spark is still pretty non performant.
If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.
My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.
This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.
3 replies →
In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.
Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.
I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.