Comment by sgt101

9 years ago

The first paying job we ran through our Hadoop cluster in 2011 had 12 billon rows, and they were fairly big rows. This was beyond the limit of what our proprietary MPP database cluster could handle in the processing window it had (being fair the poor thing was/is loaded 90%+ which is not a great thing, but a true thing for many enterprises). We couldn't get budget for the scaling bump we hit with the evolution of that machine, but we could pull together a six node Hadoop machine and lo and behold, for a pittance we got a little co-processor that could. One other motivation was/is that use case accumulates 600m rows a day, and we were then able to engineer (cheap) a solution that can hold 6mths of that data vs 20 days. After 6mths our current view is that it's not worth keeping the data, but we are beginning to get cases of regret that we've ditched longer window stuff.

There are queries and treatments that process 100's of billions of substantial database rows on other cheap open source infrastructures, and you can buy proprietary data systems that do it as well (and they are good) but if you want to do it cheaply and flexibly then so far I think that Hadoop wins.

I think that Hadoop won 4 years ago and has been the centre of development every since (in fact before when MS cancelled Dryad) I think it will continue to be the weapon of choice for at least 6 more years and will be around and important for 20 more after that. My only strategic concern is the filesystem splintering that is going on with HDFS/Kudu.