Comment by mej10

9 years ago

> 1.75GB

> 3.46GB

These will fit in memory on modest hardware. No reason to use Hadoop.

The title could be: "Using tools suitable for a problem can be 235x faster than tools unsuitable for the problem"

This is exactly the point he was making.

People have a desire to use the 'big' tools instead of trying to solve the real problem.

People both underestimate the power of their desktop machine and the 'old' tools and overestimate the size of their task.

  • Occasionally designers seem to seek credit merely for possessing a new technology, rather than using it to make better designs. Computers and their affiliated apparatus can do powerful things graphically, in part by turning out the hundreds of plots necessary for good data analysis. But at least a few computer graphics only evoke the response "Isn’t it remarkable that the computer can be programmed to draw like that?" instead of "My, what interesting data".

    - Edward Tufte

    Applies to more than just design.

  • >People have a desire to use the 'big' tools

    Not only that, people seems to love claiming that they're "big data", perhaps because it makes them sound impressive and bigger than they are.

    Very few of us will ever do projects that justifies using tools like Hadoop and to few us are willing to accept that our data fits in SQLite.

  • I love it when clients think they need a server workstation

    Specs be damned!

    I need to start selling boxes

Last paragraph of the article: "Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools"

That was exactly his point.

Not necessarily true. Depending on your use cases, it often still makes sense to use Hadoop. A really common scenario is that you'll implement your 3.5 GB job on one box, then you'll need to schedule it to run hourly. Then you'll need to schedule 3 or 4 to run hourly. Then your boss will want you to join it with some other dataset, and run that too with the other jobs. You'll eventually implement retries, timeouts, caching, partitioning, replication, resource management, etc.

You'll end up with a half assed, homegrown Hadoop implementation, wishing you had just used Hadoop from the beginning.