Comment by ende

9 years ago

This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.

I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.

  • It's the tooling, not the size of the data. Using "big data ecosystem" tools allows you to use all kinds of useful things like Airflow for pipeline processing, Presto to query the data, Spark for enrichment and machine learning etc... all of that without moving the data, which simplifies greatly metadata management which has to be done if you're serious about things like data provenance and quality.

  • That's definitely true too. Being able to accurately assess whether that need will ever exist (or not) early on is invaluable.

These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!