Comment by sgt101

9 years ago

Ok, now do it for >2tb.

Our prod hadoop dataset is now > 130tb, try that!

> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools. If you have a huge amount of data or really need distributed processing, then tools like Hadoop may be required, but more often than not these days I see Hadoop used where a traditional relational database or other solutions would be far better in terms of performance, cost of implementation, and ongoing maintenance.

You can get 128GB DIMMs these days, so 2tb is easy to fit in memory. 130tb, yes that's a different story.

I am a bioinformatician. 130tb of raw reads or processed data? Are you trying to build a general purpose platform for all *-seq or focusing on something specific (genotyping)?

  • I think you might be replying to my comment. We just took delivery of a 20K WGS callset that is 30TB gzip compressed (about 240TB uncompressed) and expect something twice as big by the end of the year. We're trying to build something pretty general for variant level data (post calling, no reads), annotation and phenotype data. Currently we focus on QC, rare and common variant association and tools for studying rare disease. Everything is open source, we develop in the open and we're trying hard to make it easy for others to develop methods on our infrastructure. Feel free to email if you'd like to know more.

Note the magic words were "can be faster", not "are faster".

If you'd read the entire article you'd even have picked up that he's explicitly calling out use of hadoop for data that easily fits in memory, not large data sets.

Some back of the hand calculations show it would take about 3 days using the article's method and a 2gbit pipe.

Out of curiosity, how long do you take to process 130tb on hadoop and where/how is the data stored?

  • It's about four hours on a on prem commodity cluster with ~PB raw storage on 22 nodes. Each node has 12 4TB disks (fat twins) and two xeons with (I think) 8 cores, and 256GB ram. It's got a dedicated 10GbE network with it's own switches.

    The processing is a record addition per item (basically there is a new row for a matrix for every item we are examining, and an old row has disappeared) and a recalculation of aggregate statistics based on the old data and the new data - the aggregate numbers are then off loaded to a front end database for online use. The calculations are not rocket science so this is not compute bound in any way.

    I think that we can do it because of data parallelism, the right data is available on each node and every core and every disk, so each pipeline just thumps away at ~50Mbs, there are about 300 of them so that's lots of crunching. At the moment I can't see why we can't scale more, although I believe that the max size of the app will be no more than *2 where we are now.