Comment by ende

9 years ago

This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.

10 comments

ende

xorgar831 9 years ago

I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.

luckydata 9 years ago
It's the tooling, not the size of the data. Using "big data ecosystem" tools allows you to use all kinds of useful things like Airflow for pipeline processing, Presto to query the data, Spark for enrichment and machine learning etc... all of that without moving the data, which simplifies greatly metadata management which has to be done if you're serious about things like data provenance and quality.
- xorgar831 9 years ago
  
  A SQL db + Tableau are vastly more powerful and mature than those tools, they just can't do "big data", that's all.
  
  1 reply →
- qwertyuiop924 9 years ago
  
  ...and how often do you need to do all that to a dataset?
ende 9 years ago

That's definitely true too. Being able to accurately assess whether that need will ever exist (or not) early on is invaluable.

elcritch 9 years ago

These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!

jacques_chester 9 years ago

Day 2 problems don't come to mind during days 0 and 1.

shitgoose 9 years ago
Sometimes they don't come at all.
- shitgoose 9 years ago
  
  That was quote from Austin Powers. And a reference to the fact that Day 2 requirements almost never come.