Comment by hliyan
2 years ago
I've written this comment before: in 2007, there was a period where I used to run an entire day's worth of trade reconciliations of one of the US's primary stock exchanges on my laptop (I was on-site engineer). It was a Perl script, and it completed in minutes. A decade later, I watched incredulously as a team tried to spin up a Hadoop cluster (or Spark -- I forget which) over several days, to run a work load an order of magnitude smaller.
> over several days, to run a work load an order of magnitude smaller
Here I sit, running a query on a fancy cloud-based tool we pay nontrivial amounts of money for, which takes ~15 minutes.
If I download the data set to a Linux box I can do the query in 3 seconds with grep and awk.
Oh but that is not The Way. So here I sit waiting ~15 minutes every time I need to fine tune and test the query.
Also, of course the query now is written in the vendor's homegrown weird query language which is lacking a lot of functionality, so whenever I need to do some different transformation or pull apart data a bit differently, I get to file a feature request and wait a few month for it to be implemented. On the linux box I could just change my awk parameters a little bit (or throw perl in the pipeline for heavier lifting) and be done in a minute. But hey at least I can put the ticket in blocked state for a few months while waiting for the vendor.
Why are we doing this?
>Why are we doing this?
someone got promoted
Oh how true this is. At my current work we use _kubernetes_ for absolutely no reason at all other than the guy in charge of infra wanted to learn it.
The result? 1. I don't have access to basic logs for debugging because apparently the infra guy would have to give me access to the whole cluster. 2. Production ends up dying from time to time because apparently they don't know how to set it up. 3. The boss likes him more because he's using big boy tools.
yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
Just because your throw-away 40 line script worked from cron for five years without issue doesn't mean that a seven node hadoop cluster didn't come with benefits. You got to write in a language called "pig"! so fun.
I still think that it'd be easier to maintain the script that runs on a single computer than to maintain a hadoop cluster.
The resume would look better if you used Python and Polars ;-)
2 replies →
s/he was obviously joking
maybe we should all start to add "evaluated a hadoop cluster for X applications and saved the company 1mi (in time, headcount, and uptime) a year going with a 40line perl script"
I like this idea. And something similar for evaluating blockchains and sticking with a relational database instead.
> yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
That is why Rust is so awesome. It still allows me to get stuff in my resume, but still make an executable that runs on my laptop with high performance.
Id love to hear what the benefits are to using a framework for the wrong purpose
Resume entries!
2 replies →
There was a time, about 10 years ago, when Hadoop/Spark was on just about every back-end job post out there.
I was in the field at the time and I agree. I thought it had to be what the big boys used. Then I realized that my job involves huge amounts of structured data and our MySQL instance handled everything quite well.
People should first try the simplest most obvious solution just to have a baseline before they jump into the fancy solutions.
I imagine your laptop had an SSD.
People who weren’t developing around this time can’t appreciate how game changing SSDs were then spinning rust.
I/O was no longer the bottleneck post SSD’s.
Even today, people way underestimate the power of NVME.