Comment by aub3bhat

9 years ago

This article is a great litmus test for checking if someone has experience working at scale (Multi Terabytes, Multiple analysts, Multiple job types) or not. Anyone who has had that experience will instantly describe why this article is wrong. It's akin to saying a Tesla is faster than Boeing 777 on a 100 meter track.

I'd hope people who have worked at scale still are capable of recognizing when the tools they used there are totally overkill. I'd suspect they would, since they'd also be more aware of their limitations (vs somebody without experience, who has to believe the "you need big data and everything is easy" marketing).

That you wouldn't use a Boing 777 IF your problem is just a 100m track is the entire point of the article. It's explicitly not saying that you never should use the big tools.

  • They are not overkill at all, rather they are tuned towards different set of performance characteristics. E.g. in the Boeing 777 example above, transatlantic journey.

    In the article above, the data and results stay on the local disk, however in any organization, they need to be stored in a distributed manner, available to multiple users with varying levels of technical expertise. Typically in NFS or HDFS, preferably if they are records stored/indexed via Hive/Presto. At which point the real issue is how do you reduce the delay resulting from transferring data over the network. Which is what the original idea (moving computation closer to data) behind Hadoop/MapReduce.

    • rolls eyes

      The point is that if you've got such tiny quantities of data, why are you storing it in a distributed manner, and why are you breaking out the 777 for a trip around the racetrack? Grab the 777 when you need it, and take the Tesla when you need the performance characteristics of a Tesla.