Comment by geocar
9 years ago
> you cannot expect all of them to SSH into a single machine
Why not?
I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.
> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]
This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.
And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).
If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.
> Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.
You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures.
And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion?
And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc.
I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks.
And people with 64TB, 256 core machines don't have RAID arrays attached to their machine for this exact reason?
If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.
> If it's "machines" plural, than you can do replication between the two.
This is the start of a scaling path that winds down Distributed Systems Avenue, and eventually leads to a place called Hadoop.
(Replication and consensus are remarkably difficult problems that Hadoop solves).
5 replies →
Both disk and CPU failures are recoverable on expensive hardware.
"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "
There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.
http://www.nas.nasa.gov/hecc/resources/pleiades.html
Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.
Both have mechanisms for doing distributed processing on data that is too big for a single machine.
The original argument was that command line tools etc are sufficient. In both these cases they aren't.