← Back to context

Comment by qwertyuiop924

9 years ago

But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?

Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...

Totally agree. It'd be relatively trivial to automate converting this script into a distributed application. Haven't checked Manta out, but I will. For ultimate performance though right now you could go for something like OpenMP + MPI which gets you scalability/fault-tolerance. In a few months you'll also be able to use something like RaftLib as a dataflow/stream processing API for distributed computation (almost ready to roll out the distributed back-end). MPI though has decades of research in HPC to make it the most robust distributed compute platform in existence (though not the most easy to use). You think your big data problems are big...nah, supercomputers were doing todays big data back in the late 90's. Just a totally different crowd with slightly different solutions. MPI is hard to use, Spark/Storm is much easier...but much slower.

From my experience organizations have adopted, Hive/Presto/Spark on top of Hadoop. Which actually solves a whole bunch of problems that "script" approach would not. With several added benefits. Executing scripts (cat, grep, uniq, sort) do not provide similar, benefits, while they might be faster. A dedicated solution such as Presto by Facebook will provide similar if not even faster results.

https://prestodb.io/

  • Ah, so it doesn't solve data storage, and runs SQL queries, which are less capable than UNIX commmands. If your data's stuck inside 15 SQL DBs, than that'd make sense, but a lot of data is just stored in flat files. And you know what's really good at analyzing flat files? Unix commands.

    • Did you even read it? Presto reads directly from HDFS, which is as close to distributed "flat files" as you can get. As far as "SQL being less capable than UNIX commands", you have got to be kidding me. SQL allows type checking, conversion, joins all of which are difficult if not impossible with grep | uniq | sort etc.

      3 replies →