Comment by amelius
9 years ago
But what would happen if those exact same command-line tools were used inside a Hadoop node? What would be the optimum number of processors then?
9 years ago
But what would happen if those exact same command-line tools were used inside a Hadoop node? What would be the optimum number of processors then?
That depends on the tradeoff between management/transfer overhead and actually doing work.
Always in the "word count" style examples, but quite often in real life, the "get the data into the process" takes more time than actually processing it.
When you need to distribute, you need to distribute. However, the point where "you need to distribute" is about 100x more data than the time most hadoop users do, and the overhead costs are far from negligible - in fact, they dominate everything until you get to 100x more data.
you would just be adding management-overhead.
More software != more efficient software.
But faster because parallel.