Comment by cs702

9 years ago

Most "big data" problems are really "small data" by the standards of modern hardware:

* Desktop PCs with up to 6TB of RAM and many dozens of cores have been available for over a year.[1]

* Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced.[2]

CORRECTION: THE FIGURE IS 60TB, NOT 100TB. See MagnumOpus's comment below. In a haste, I searched Google and mistakengly linked to an April Fool's story. Now I feel like a fool, of course. Still, the point is valid.

* Four Nvidia Titan X GPUs can give you up to 44 Teraflops of 32-bit FP computing power in a single desktop.[3]

Despite this, the number of people who have unnecessarily spent money and/or complicated their lives with tools like Hadoop is pretty large, particularly in "enterprise" environments. A lot of "big data" problems can be handled by a single souped-up machine that fits under your desk.

[1] https://news.ycombinator.com/item?id=12141334

Well basically as soon as big data hit the news everyone stopped doing data and started doing big data.

Same thing with data science. Opened a Jupyter notebook, loaded 100 rows, and displayed a graph - "I am a data scientist".

> Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced

That is an April Fools story.

(Of course you can still get a Synology DS2411+/DX1211 24-bay NAS combo for a few thousand bucks, but it will take up a lot of space under your desk and keep your legs toasty...)

Right, "data scientists" with experience call themselves statisticians or analysts or whatever. The "data science" or "big data" industry is comprised of people who just think a million rows of data sounds impressively big because they never experienced data warehouses in the 1990s where a million rows was not even anything special...

  • The first paying job we ran through our Hadoop cluster in 2011 had 12 billon rows, and they were fairly big rows. This was beyond the limit of what our proprietary MPP database cluster could handle in the processing window it had (being fair the poor thing was/is loaded 90%+ which is not a great thing, but a true thing for many enterprises). We couldn't get budget for the scaling bump we hit with the evolution of that machine, but we could pull together a six node Hadoop machine and lo and behold, for a pittance we got a little co-processor that could. One other motivation was/is that use case accumulates 600m rows a day, and we were then able to engineer (cheap) a solution that can hold 6mths of that data vs 20 days. After 6mths our current view is that it's not worth keeping the data, but we are beginning to get cases of regret that we've ditched longer window stuff.

    There are queries and treatments that process 100's of billions of substantial database rows on other cheap open source infrastructures, and you can buy proprietary data systems that do it as well (and they are good) but if you want to do it cheaply and flexibly then so far I think that Hadoop wins.

    I think that Hadoop won 4 years ago and has been the centre of development every since (in fact before when MS cancelled Dryad) I think it will continue to be the weapon of choice for at least 6 more years and will be around and important for 20 more after that. My only strategic concern is the filesystem splintering that is going on with HDFS/Kudu.

So you have large data storage, and processing that can handle large data (assuming for convenience that you have a conventional x86 processor with that throughput). The only problem that remains is moving things from the former to the latter, and then back again once you're done calculating.

That's (100 * 1024 GB) / (20 GB/s) = 85 minutes just to move your 100 TB to the processor assuming your storage can operate at the same speed as DDR4 RAM. A 100 node Hadoop cluster has (100 * 1024 GB) / (0.2 * 100 GB/s) throughput with commodity disks.

Back-of-the-envelope stuff, obviously, with caveats everywhere.

Problem with that kind of setup is that if you unexpectedly need to scale out of that, you haven't done any of the work required to do that, and you're stuck.

  • How often do you "unexpectedly need to scale out"? By an order of magnitude at least that is, because under that you could add a few more of those beefed-up machines.

    I wonder what happened with YAGNI principle. It has arguable uses in some places, but this one it seems to fit perfectly.

    • I've had such a situation and we were lucky that we had written software that makes dealing with embarassingly parallel problems embarassingly scalable.

Yes, there are desktops with high amounts of Ram but to buy a machine like that would probably be more than setting up a hadoop cluster on commodity hardware. And for embarrassingly parallel problem, hadoop can scale semi-seemlessly.

In reality, it still takes work... but can be done.