Comment by nickpsecurity

9 years ago

"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "

There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.

http://www.nas.nasa.gov/hecc/resources/pleiades.html

Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.

Both have mechanisms for doing distributed processing on data that is too big for a single machine.

The original argument was that command line tools etc are sufficient. In both these cases they aren't.