Comment by hobs

2 years ago

In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.

Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.

1 comment

hobs

michaelmior 2 years ago

I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.