Comment by amelius

9 years ago

But what would happen if those exact same command-line tools were used inside a Hadoop node? What would be the optimum number of processors then?

3 comments

amelius

beagle3 9 years ago

That depends on the tradeoff between management/transfer overhead and actually doing work.

Always in the "word count" style examples, but quite often in real life, the "get the data into the process" takes more time than actually processing it.

When you need to distribute, you need to distribute. However, the point where "you need to distribute" is about 100x more data than the time most hadoop users do, and the overhead costs are far from negligible - in fact, they dominate everything until you get to 100x more data.

TickleSteve 9 years ago

you would just be adding management-overhead.

More software != more efficient software.

amelius 9 years ago

But faster because parallel.