Comment by mej10

9 years ago

> 1.75GB

> 3.46GB

These will fit in memory on modest hardware. No reason to use Hadoop.

The title could be: "Using tools suitable for a problem can be 235x faster than tools unsuitable for the problem"

22 comments

mej10

TickleSteve 9 years ago

This is exactly the point he was making.

People have a desire to use the 'big' tools instead of trying to solve the real problem.

People both underestimate the power of their desktop machine and the 'old' tools and overestimate the size of their task.

sotojuan 9 years ago

Occasionally designers seem to seek credit merely for possessing a new technology, rather than using it to make better designs. Computers and their affiliated apparatus can do powerful things graphically, in part by turning out the hundreds of plots necessary for good data analysis. But at least a few computer graphics only evoke the response "Isn’t it remarkable that the computer can be programmed to draw like that?" instead of "My, what interesting data".
- Edward Tufte
Applies to more than just design.
mrweasel 9 years ago

>People have a desire to use the 'big' tools
Not only that, people seems to love claiming that they're "big data", perhaps because it makes them sound impressive and bigger than they are.
Very few of us will ever do projects that justifies using tools like Hadoop and to few us are willing to accept that our data fits in SQLite.
cheez 9 years ago
Yeah, someone was telling me they need big data for a million rows. I laughed and said SQLite handles that...
- matt_wulfeck 9 years ago
  
  I would not want to be the one on-call for a million row SQLite database!
  
  11 replies →
cloudjacker 9 years ago

I love it when clients think they need a server workstation
Specs be damned!
I need to start selling boxes

carlesfe 9 years ago

Last paragraph of the article: "Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools"

That was exactly his point.

cle 9 years ago

Not necessarily true. Depending on your use cases, it often still makes sense to use Hadoop. A really common scenario is that you'll implement your 3.5 GB job on one box, then you'll need to schedule it to run hourly. Then you'll need to schedule 3 or 4 to run hourly. Then your boss will want you to join it with some other dataset, and run that too with the other jobs. You'll eventually implement retries, timeouts, caching, partitioning, replication, resource management, etc.

You'll end up with a half assed, homegrown Hadoop implementation, wishing you had just used Hadoop from the beginning.

taf2 9 years ago

1TB will fit in memory too: https://aws.amazon.com/blogs/aws/ec2-instance-update-x1-sap-...

coredog64 9 years ago

I'd rather pay for 7 c4.large* instances ($.70/hour for all of them) compared to an x1 ($13.38/hour).
*The original article is from 2014 and references c1.medium which are no longer available. c4.large is the closest you can get without dipping down into the t2 class.