← Back to context

Comment by sgt101

9 years ago

It's about four hours on a on prem commodity cluster with ~PB raw storage on 22 nodes. Each node has 12 4TB disks (fat twins) and two xeons with (I think) 8 cores, and 256GB ram. It's got a dedicated 10GbE network with it's own switches.

The processing is a record addition per item (basically there is a new row for a matrix for every item we are examining, and an old row has disappeared) and a recalculation of aggregate statistics based on the old data and the new data - the aggregate numbers are then off loaded to a front end database for online use. The calculations are not rocket science so this is not compute bound in any way.

I think that we can do it because of data parallelism, the right data is available on each node and every core and every disk, so each pipeline just thumps away at ~50Mbs, there are about 300 of them so that's lots of crunching. At the moment I can't see why we can't scale more, although I believe that the max size of the app will be no more than *2 where we are now.