Comment by hobos_delight

2 years ago

…yes - processing 3.2G of data will be quicker on a single machine. This is not the scale of Hadoop or any other distributed compute platform.

The reason we use these is for when we have a data set _larger_ than what can be done on a single machine.

Most people who wasted $millions setting up Hadoop didn’t have data sets larger than could fit on a single machine.

  • I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.

  • I completely agree. I love the tech and have spent a lot of time in it - but come on people, let’s use the right tool for the right job!

  • Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?

    I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.

    I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.

    Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?

    • I did some data processing at Ubisoft.

      each node in our hadoop cluster had 64GiB of ram (which is the max amount you should have for a single node java application, where 32G is allocated for heap FWIW), we had I think 6 of these nodes for a total of 384GiB memory.

      Our storage was something like 18TiB across all nodes.

      It would be a big machine, but our entire cluster could easily fit. Largest machine on the market right now is something like 128CPU's and 20TiB of Memory.

      384GiB was available in a single 1U rackmount server at least as early as 2014.

      Storage is basically unlimited with direct-attached-storage controllers and rackmount units.

      2 replies →

    • > Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?

      I was a SQL Server DBA at Cox Automotive. Some director/VP caught the Hadoop around 2015 and hired a consultant to set us up. The consultant's brother worked at Yahoo and did foundational work with it.

      Consultant made us provision 6 nodes for Hadoop in Azure (our infra was on Azure Virtual Machines) each with 1 TB of storage. The entire SQL Server footprint was 3 nodes and maybe 100 GB at the time, and most of that was data bloat. He complained about such a small setup.

      The data going into Hadoop was maybe 10 GB, and consultant insisted we do a full load every 15 minutes "to keep it fresh". The delta for a 15 minute interval was less than 20 MB, maybe 50 MB during peak usage. Naturally his refresh script was pounding the primary server and hurting performance, so we spent additional money to set up a read replica for him to use.

      Did I mention the loading process took 16-17 minutes on average?

      You can quit reading now, this meets your request, but in case anyone wants a fuller story:

      Hadoop was used to feed some kind of bespoke dashboard product for a customer. Everyone at Cox was against using Microsoft's products for this, while the entire stack was Azure/.Net/SQL Server...go figure. Apparently they weren't aware of PowerBI, or just didn't like it.

      I asked someone at MS (might have been one of the GuyInACube folks, I know I mentioned it to him) to come in and demo PowerBI, and in a 15 minute presentation absolutely demolished everything they had been working on for a year. There was a new data group director who was pretty chagrined about it, I think they went into panic mode to ensure the customer didn't find out.

      The customer, surprisingly, wasn't happy with the progress or outcome of this dashboard, and were vocally pointing out data discrepancies compared to the production system. Some of them days or even a week out of date.

      Once the original contract was up, and time to renew, the Hadoop VP now had to pay for the project from his budget, and about 60 days later it was mysteriously cancelled. The infra group was happy, as our Azure expenses suddenly halved, and our database performance improved 20-25%.

      The customer seemed to be happy, they didn't have to struggle with the prototype anymore, and wow, where did all these SSRS reports that were perfectly fine come from? What do you mean they were there all along?

    • Developers are taught that you must scale horizontally. They become seniors and managers and ruin everything they touch.

      I have to teach developers that yes, we can have a 500MB data cache in ram, and that’s actually not a lot at all.

    • I used to work for a pretty famous 2nd tier US company (smaller and less cool than FAANG).

      They had a team working on a Hadoop based solution and their biggest internal implementations was about what you're describing, in practice.

      It makes sense because internal politics.

      1 reply →

    • I worked at a corp that had built a Hadoop cluster for lots of different heterogeneous datasets used by different teams. It was part of a strategy to get "all our data in one place". Individually, these datasets were small enough that they would have fitted perfectly fine on single (albeit beefy for the time) machines. Together, they arguably qualified as big data, and justification for the decision to use Hadoop was because some analytics users occasionally wanted to run queries that spanned all of these datasets. In practice, these kind of queries were rare and not very high value, so the business would have been better off just not doing them, and keeping the data on a bunch of siloed SQL Servers (or, better, putting some effort into tiering the rarely used data onto object storage).

    • I wonder if companies built Hadoop clusters for large jobs and then also use them for small ones.

      At work, they run big jobs on lots of data on big clusters. The processing pipeline also includes small jobs. It makes sense to write them in Spark and run them in the same way on the same cluster. The consistency is a big advantage and that cluster is going to be running anyway.

  • Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.

This is exactly the point of the article. From the conclusion:

> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

What can be done on a single machine grows with time though. You can have terabytes of ram and petabytes of flash in a single machine now.

This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.

And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.

  • Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.

  • The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.

    Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.

    • > "email Jeff and ask him to run his script" isn't scalable

      Sure, it's not.

      But the only alternative to that is not building some monster cluster to process a few gigabytes.

      You can write a good script (instead of hacking one together), put it in source control and pull it from there automatically to the production server and run it regularly from cron. Now you have your robustness, reproducibility and consistency as well as much higher performance, for about one-ten-thousandth of the cost.