Comment by MrBuddyCasino

2 years ago

This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.

And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.

8 comments

MrBuddyCasino

hobos_delight 2 years ago

Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.

MrBuddyCasino 2 years ago
This is true for startups an small companies, Big Corps IT is so far away from operating efficiently that this doesn't really matter.
- hobos_delight 2 years ago
  
  I think it completely matters - yes these orgs are a lot more wasteful, but there is still an opportunity to save money here, especially is this economy, if not for the internal politics wins.
  I’ve spent time in some of the largest distributed computing deployments and cost was always a constant factor we had to account for. The easiest promos were always “I saved X hundred million” because it was hard to argue against saving money. And these happened way more than you would guess.
  
  3 replies →

dagw 2 years ago

The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.

Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.

jjav 2 years ago

> "email Jeff and ask him to run his script" isn't scalable
Sure, it's not.
But the only alternative to that is not building some monster cluster to process a few gigabytes.
You can write a good script (instead of hacking one together), put it in source control and pull it from there automatically to the production server and run it regularly from cron. Now you have your robustness, reproducibility and consistency as well as much higher performance, for about one-ten-thousandth of the cost.