I've written this comment before: in 2007, there was a period where I used to run an entire day's worth of trade reconciliations of one of the US's primary stock exchanges on my laptop (I was on-site engineer). It was a Perl script, and it completed in minutes. A decade later, I watched incredulously as a team tried to spin up a Hadoop cluster (or Spark -- I forget which) over several days, to run a work load an order of magnitude smaller.
> over several days, to run a work load an order of magnitude smaller
Here I sit, running a query on a fancy cloud-based tool we pay nontrivial amounts of money for, which takes ~15 minutes.
If I download the data set to a Linux box I can do the query in 3 seconds with grep and awk.
Oh but that is not The Way. So here I sit waiting ~15 minutes every time I need to fine tune and test the query.
Also, of course the query now is written in the vendor's homegrown weird query language which is lacking a lot of functionality, so whenever I need to do some different transformation or pull apart data a bit differently, I get to file a feature request and wait a few month for it to be implemented. On the linux box I could just change my awk parameters a little bit (or throw perl in the pipeline for heavier lifting) and be done in a minute. But hey at least I can put the ticket in blocked state for a few months while waiting for the vendor.
yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
Just because your throw-away 40 line script worked from cron for five years without issue doesn't mean that a seven node hadoop cluster didn't come with benefits. You got to write in a language called "pig"! so fun.
maybe we should all start to add "evaluated a hadoop cluster for X applications and saved the company 1mi (in time, headcount, and uptime) a year going with a 40line perl script"
> yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
That is why Rust is so awesome. It still allows me to get stuff in my resume, but still make an executable that runs on my laptop with high performance.
This was nicely foreseen in the original Map - Reduce paper, where the authors write:
> The issues of how to parallelize the computation, distribute the data, and
> handle failures conspire to obscure the original simple computation with large
> amounts of complex code to deal with these issues. As a reaction to this
> complexity,we designed anew abstraction that allows us to express the simple
> computations we were trying to perform but hides the messy details of
> parallelization, fault-tolerance, data distribution and load balancing in
> a library .
If you are not meeting this complexity (and today with 16 TB of RAM and 192 cores, many jobs don't) then Map-Reduce / Hadoop is not for you...
There is an incentive for people to go horizontal rather than permitting themselves to go vertical.
Makes sense, we are told that vertical has limits in university and we should prioritise horizontal; but I feel a little like the "mid-wit" meme, once we realise how vertical we can go then we can end up using significantly fewer resources in aggregate (as there is overhead in distributed systems of course).
I also think we are disincentivised from going vertical as most cloud providers prioritise splitting workloads, most people don't have 16TiB of RAM available to them, but they might have a credit card on file for a cloud provider/hyperscaler.
*EDIT*: Largest AWS Instance is, I think, the x2iedn.metal ith 128vCPU and 4TiB RAM
*EDIT2*: u-24tb1.metal seems larger; 448vCPU and 24TiB Memory, but I'm not sure if you can actually use it for anything that's not SAP HANA.
Horizontal scaling did have specific incentives when Map Reduce got going and today also in the right parameter space.
For example, I think Dean & Ghemawat reasonably describe what were their incentives: saving capital by reusing an already distributed set of machines while conserving network bandwidth. In table 1 they write average job duration was around 10 minutes involving 150 computers and that on average 1.2 workers died per such job!
The computers had 2-4 GiB memory, 100megabit ethernet and ISA HDDs. In 2003 when they got map reduce going Google's total R&D budget was $90million. There was no cloud so if you wanted a large machine you had to pay up front.
What they did with Map Reduce is a great achievement.
But I would advise against scaling horizontally right from the start because we may need to scale horizontally at some time in future. If it will fit on one machine, do it on one.
MapReduce came along at a moment in time where going horizontal was -essential-. Storage had kept increasing faster than CPU and memory, and CPUs in the aughts encountered two significant hitches: the 32-bit to 64-bit transition and the multicore transition. As always, software lagged these hardware transitions; you could put 8 or 16GB of RAM in a server, but good luck getting Java to use it. So there was a period of several years where the ceiling on vertical scalability was both quite low and absurdly expensive. Meanwhile, hard drives and the internet got big.
One of my favorite posts. I'll always upvote this. Of course there are use cases one or two standard deviations outside the mean that require truly massive distributed architectures, but not your shitty csv / json files.
Reflecting on a decade in the industry I can say cut, sort, uniq, xargs, sed, etc etc have taken me farther than any programming language or ec2 instance.
My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.
It’s been 8 years and I think RDBMS is stronger than ever?
Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.
Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.
In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.
Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.
These posts always remind me of the [Manta Object Storage](https://www.tritondatacenter.com/triton/object-storage) project by Joyent. This project was basically a combination of object storage with the added ability to run arbitrary programs against your data in situ. The primary, and key, difference being that you kept the data in place and distributed the program to the data storage nodes (the opposite of most data processing as I understand it), I think of this as a superpowered version of using [pssh](https://linux.die.net/man/1/pssh) to grep logs across a datacenter. Yet another idea before its time. Luckily, Joyent [open sourced](https://github.com/TritonDataCenter/manta) the work, but the fact that it still hasn't caught on as "The Way" is telling.
Some of the projects I remember from the Joyent team were: dumping recordings of local mariokart games to manta and running analytics on the raw video to generate office kart racer stats, the bog standard dump all the logs and map/reduce/grep/count them, and I think there was one about running mdb postmortems on terabytes of core dumps.
On a similar reasoning, in 2008 or such, I observed that, while our Java app would be able to run more user requests per second than our Python version, it’d take months for the Java app to overtake the Python one in total requests served because it’d have to account for a 6 month head start.
Far too often we waste time optimising for problems we don’t have, and, most likely, will never have.
I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.
Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.
I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.
Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?
Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.
This is exactly the point of the article. From the conclusion:
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.
Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.
The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.
Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.
I bought a Raspberry Pi 4 for Christmas. It's connected to my dev laptop directly via wired connection. My self imposed challenge for this year is to try to offload as much work to this little Pi as I can. So I'm a fan of this approach.
Even in with large scale data Hadoop/Spark tend to be used in ways that makes no sense, as if something being self described as big data means that as soon as you cross some threshold you SHOULD be using it.
Recently had an argument with a senior engineer on our team because a pipeline that processed several PB of data, scaled to +1000 machines and was all account a success was just a Python script using multiprocessing distributed with ECS and didn't use Spark.
Common command line tools are often the best for analyzing and understanding HPC clusters and issues. People have often asked me for tools and web pages to figure out how to understand and figure out issues in our cluster, or asked if we could use some tool like Hadoop, Spark, or some Azure/GCP/AWS tool to do it faster. I've said that if they want to spend the effort to use those tools, it could be valuable; but if it takes me 10min to use those tools and <1min using command line tools, I'll always fall back to the command line.
That's not to say that fancy tools don't have their use; but people often forget how much you can do with a few simple commands if you understand a pipeline and how the commands work.
There was some environment (Ice surface movements in the South Pole I think) related researcher who rewrote his calculations from Nvidia and GPU computing to a plain C file. The NV task lasted for months; later, seconds.
This is one of my favourite posts.
Part of my PhD was based on this post. https://discovery.ucl.ac.uk/id/eprint/10085826/ (Section 4.1).
I presented this section of my research in a conference and won best paper award as well.
If we write dedicated tools, speeds boost can be enormous. We can process 1 billion rows from a simple CSV in just 2 seconds. In slow Java.
It just requires some skills, which is hard to find nowadays.
Like a lot of things, people tend to make a decision for horizontal vs vertical and then stick with it even as the platforms or "physics" change underneath them over time. Same for memory bandwidth (which people, like Sun, thought would remain more of a bottleneck than it actually turned out to be).
What is the largest data set people here are processing daily for ETL on one machine? What tools are you using, and what does the job do? I want to know how capable new libraries like polars are, and how far you can delay transitioning to Spark. Are terabyte datasets feasible yet?
I remember reading this article several years ago. Good to see it again. I remember when everyone thought their data was big data. How provincial we were.
272MB/s is "consumer spinning rust RAID 5 via USB3" speeds. I know this because that's roughly what my NAS backup machine does on writes/reads. From a NAS on gbit it's overpowered by 150%, for sure.
Actually the awk solution in the blog post doesn’t load the entire dataset into memory. It is not limited by RAM. Even if you make the input 100x larger, mawk will still be hundreds of times faster than Hadoop. An important lesson here is streaming. In our field, we often process >100GB data in <1GB memory this way.
This. For many analytical use cases the whole dataset doesn't have to fit into memory.
Still: of course worthwhile to point out how oversized a compute cluster approach is when the whole dataset would actually fit into memory of a single machine.
It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'. The finding on OP's website, of what is or isn't fast for what purpose, should not be surprising us, 79 years after the first programmable computer. Yet we go about our work, blissfully ignorant of what the actual capabilities of what we're doing are.
We don't have hypotheses, experiments, and results published, of what a given computing system X, made up of Y, does or doesn't achieve. There are certainly research papers, algorithms and proof-of-concepts, but (afaict) no scientific evidence for most of the practices we follow and results we get.
We don't have engineering specifications or tolerances for what a given thing can do. We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do. We don't even have institutional knowledge of all the problems a given engineering effort faces, and how to avoid those problems. When we do have institutional knowledge, it's in books from 4 decades ago, that nobody reads, and everyone makes the same mistakes again and again, because there is no institutional way to hold people to account to avoid these problems.
What we do have, is some tool someone made, that then millions of dollars is poured into using, without any realistic idea whatsoever what the result is going to be. We hope that we get what we want out of it once we're done building something with it. Like building a bridge over a river and hoping it can handle the traffic.
There are two reasons creating software will never (in my lifetime) be considered an engineering discipline:
1) There are (practically) no consequences for bad software.
2) The rate of change is too high to introduce true software development standards.
Modern engineering best practice is "follow the standards". The standards were developed in blood -- people were either injured or killed, so the standard was developed to make sure it didn't happen again. In today's society, no software defects (except maybe aircraft and medical devices) are considered severe enough for anyone to call for the creation and enforcement of standards. Even Teslas full-self-driving themselves into parked fire trucks and killing the occupants doesn't seem enough.
Engineers that design buildings and bridges also have an advantage not available to computers: physics doesn't change, at least not at scales and rates that matter. When you have a stable foundation it is far easier to develop engineering standards on that foundation. Programmers have no such luxury. Computers have only been around for less than 100 years, and the rate of change is so high in terms of architecture and capabilities that we are constantly having to learn "new physics" every few years.
Even when we do standardize (e.g. x86 ISA) there is always something bubbling in research labs or locked behind NDAs that is ready to overthrow that standard and force a generation of programmers into obsolescence so quickly there is no opportunity to realistically convey a "software engineering culture" from one generation to the next.
I look forward to the day when the churn slows down enough that a true engineering culture can develop.
Imagine what scenario we would be in if they laid down the Standards of Software Engineering (tm) 20 years ago. Most of us would likely be chafing against guidelines that make our lives much worse for negative benefit.
In 20 years we'll have a much better idea of how to write good software under economic constraints. Many things we try to nail down today will only get in the way of future advancements.
My hope is that we're starting to get close though. After all, 'general purpose' languages seem to be converging on ML* style features.
* - think standard ML not machine learning. Static types, limited inference, algebraic data types, pattern matching, no null, lambdas, etc.
"It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'"
It may not have been clear in 2014, but it is now: Data scientists are not computer scientists or software engineers. So tarring software engineers with data scientists practices is really a low blow. Not that we're perfect by any means, but that data point you're drawing a line through isn't even on the graph you're trying to draw.
I was unlucky enough to brush that world about a year ago. I am grateful I bounced off of it. It was surreal how much infrastructure data science has put into place just to deal with their mistake of choosing Python as their fundamental language. They're so excited about the frameworks being developed over years to do streaming of things that a "real" compiled language can either easily do on a single node, or could easily stream. They simply couldn't process the idea that I was not excited about porting all my code to their streaming platform because my code was already better than that platform. A constant battle with them assuming I just must not Get It and must just not understand how awesome their new platforms were, and me trying to explain how much of a downgrade it was for me.
"We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do."
Yeah, we do, actually. I use this sort of stuff all the time. Anyone who works competently at scale does, it's a basic necessity for such things. Part of the mismatch I had with the data scientists was precisely that I had this information and not only did they not, they couldn't even process that it does exist and basically seemed to assume I must just be lying about my code's performance. It just won't take the form you expect. It's not textbooks. It can't be textbooks. But that's not the criterion of whether such data exists.
We do actually have some methods of calculating an expected performance. For instance we know that a Zen4 CPU can do 4 256 bit operations per clock, with some restrictions on what combinations are allowed. We are never going to hit 4 outright in real code, but 3.5 is a realistic target for well optimised code. We can use 1 instruction to detect newline characters within those 32 bytes, then a few more to find the exact location, then a couple to determine if the line is a result, and a few more to extract that result. Given a high density of newlines this will mean something on the order of 10 instructions per 32 B block searched. Multiply the numbers and we expect to process approximately 11 B per clock cycle. On a 5 GHz CPU that would mean we would expect to be done in 32 ms, give or take. And the data would of course need to be in memory already for this time to be feasible, as loading it from disk takes appreciably longer.
Of course you have to spend some effort to actually get code this fast, and that probably isn't worth it for the one-shot job. But jobs like compression, video codecs, cryptography and that newfangled AI stuff all have experts that write code in this manner, for generally good reasons, and they can all ballpark how a job like this can be solved in a close to optimal fashion.
The way I see it is that we're in an era analogous to what came immediately after alchemy. We're all busy building up phlogiston like theories that will disprove themselves in a decade or two.
But this is better than where we just came from. Not that long ago, you would build software by getting a bunch of wizards together in a basement and hope they produce something that you can sell.
If things feel worse (I hope) that's because the rest of us muggles aren't as good as the wizards that came before us. But at least we're working in a somewhat tractable fashion.
The mathematical frameworks for construction were first laid out ~1500s (iirc). And people had been doing it since time immemorial. The mathematics for computation started about 1920-30s. And there's currently no mathematics for the comprehensibility of blocks of code. [Sure there's cyclomatic complexity and Weyuker's 9 properties, but I've got zero confidence in either of them. For example, neither of them account for variable names, so a program with well named variables is just as 'comprehensible' as a program with names composed of 500MB of random characters. Similarly, some studies indicate that CC has worse predictive power of the presence of defects than lines of code. And from what I've seen in Weyuker, they haven't shown that there's any reason to assume that their output is predictive of anything useful.]
It can be, but usually isn't. Similarly, dropping a feather and a bowling ball simultaneously might be science, or might not be. Did I make observations? Or am I just delivering some things to my friend at the bottom?
I for one does not miss having "big data" being mentioned in every meeting, talk, memo, etc. Sure it's AI now, but even that doesn't become as annoying and misunderstood as the big data fad was.
> geeks think they are rational beings, while they are completely influenced by buzz, marketing, and their emotions. Even more so than the average person, because they believe they are less susceptible to it than normies, so they have a blind spot.
> but even that doesn't become as annoying and misunderstood as the big data fad was.
Must be nostalgia. AI is much, much worse. And, even more importantly, not only it is a annoying buzzword, it is already threatening lives (see the mushroom guide written by AI) and democracies (see the "singing Modi" and "New Hampshire Officials to Investigate A.I. Robocalls").
Also both OpenAI and Anthrophic argued if licenses were required to train LLMs on copyrighted content, today’s general-purpose AI tools simply could not exist.
This might be naive, but I agree that AI hype will never be as annoying as Big Data hype.
At least 90% when people mention wanting to use AI for something, I can at least see why they think AI will help them (even if I think it will be challenging in practice).
99% of the time when people talk about big data it is complete bullshit.
So I can use a command line tool to process queries that are processing 100 TB of data? The last time I used Hadoop it was on a cluster with roughly 8PB of data.
"Can be 235x faster" != "will always be 235x faster", nor indeed "will always be faster" or "will always be possible".
The point is not that there are no valid uses for Hadoop, but that most people who think they have big data do not have big data. Whereas your use case sounds like it (for the time being) genuinely is big data, or at least at a size where it is a reasonable tradeoff and judgement call.
To people's beliefs on this, here's a Forbes article on Big Data [1] (yes, I know Forbes is now a glorified blog for people to pay for exposure). It uses as example a company with 2.2 million pages of text and diagrams. Unless those are far above average, they fit in RAM on a single server, or on a small RAID array of NVMe drives.
That's not Big Data.
I've indexed more than that as a side-project on a desktop-class machine with spinning rust.
The people who think that is big data are the audience of this, not people with actual big data.
I've written this comment before: in 2007, there was a period where I used to run an entire day's worth of trade reconciliations of one of the US's primary stock exchanges on my laptop (I was on-site engineer). It was a Perl script, and it completed in minutes. A decade later, I watched incredulously as a team tried to spin up a Hadoop cluster (or Spark -- I forget which) over several days, to run a work load an order of magnitude smaller.
> over several days, to run a work load an order of magnitude smaller
Here I sit, running a query on a fancy cloud-based tool we pay nontrivial amounts of money for, which takes ~15 minutes.
If I download the data set to a Linux box I can do the query in 3 seconds with grep and awk.
Oh but that is not The Way. So here I sit waiting ~15 minutes every time I need to fine tune and test the query.
Also, of course the query now is written in the vendor's homegrown weird query language which is lacking a lot of functionality, so whenever I need to do some different transformation or pull apart data a bit differently, I get to file a feature request and wait a few month for it to be implemented. On the linux box I could just change my awk parameters a little bit (or throw perl in the pipeline for heavier lifting) and be done in a minute. But hey at least I can put the ticket in blocked state for a few months while waiting for the vendor.
Why are we doing this?
>Why are we doing this?
someone got promoted
1 reply →
yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
Just because your throw-away 40 line script worked from cron for five years without issue doesn't mean that a seven node hadoop cluster didn't come with benefits. You got to write in a language called "pig"! so fun.
I still think that it'd be easier to maintain the script that runs on a single computer than to maintain a hadoop cluster.
4 replies →
maybe we should all start to add "evaluated a hadoop cluster for X applications and saved the company 1mi (in time, headcount, and uptime) a year going with a 40line perl script"
1 reply →
> yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
That is why Rust is so awesome. It still allows me to get stuff in my resume, but still make an executable that runs on my laptop with high performance.
Id love to hear what the benefits are to using a framework for the wrong purpose
3 replies →
There was a time, about 10 years ago, when Hadoop/Spark was on just about every back-end job post out there.
1 reply →
People should first try the simplest most obvious solution just to have a baseline before they jump into the fancy solutions.
I imagine your laptop had an SSD.
People who weren’t developing around this time can’t appreciate how game changing SSDs were then spinning rust.
I/O was no longer the bottleneck post SSD’s.
Even today, people way underestimate the power of NVME.
This was nicely foreseen in the original Map - Reduce paper, where the authors write:
If you are not meeting this complexity (and today with 16 TB of RAM and 192 cores, many jobs don't) then Map-Reduce / Hadoop is not for you...
There is an incentive for people to go horizontal rather than permitting themselves to go vertical.
Makes sense, we are told that vertical has limits in university and we should prioritise horizontal; but I feel a little like the "mid-wit" meme, once we realise how vertical we can go then we can end up using significantly fewer resources in aggregate (as there is overhead in distributed systems of course).
I also think we are disincentivised from going vertical as most cloud providers prioritise splitting workloads, most people don't have 16TiB of RAM available to them, but they might have a credit card on file for a cloud provider/hyperscaler.
*EDIT*: Largest AWS Instance is, I think, the x2iedn.metal ith 128vCPU and 4TiB RAM
*EDIT2*: u-24tb1.metal seems larger; 448vCPU and 24TiB Memory, but I'm not sure if you can actually use it for anything that's not SAP HANA.
Horizontal scaling did have specific incentives when Map Reduce got going and today also in the right parameter space.
For example, I think Dean & Ghemawat reasonably describe what were their incentives: saving capital by reusing an already distributed set of machines while conserving network bandwidth. In table 1 they write average job duration was around 10 minutes involving 150 computers and that on average 1.2 workers died per such job!
The computers had 2-4 GiB memory, 100megabit ethernet and ISA HDDs. In 2003 when they got map reduce going Google's total R&D budget was $90million. There was no cloud so if you wanted a large machine you had to pay up front.
What they did with Map Reduce is a great achievement.
But I would advise against scaling horizontally right from the start because we may need to scale horizontally at some time in future. If it will fit on one machine, do it on one.
1 reply →
MapReduce came along at a moment in time where going horizontal was -essential-. Storage had kept increasing faster than CPU and memory, and CPUs in the aughts encountered two significant hitches: the 32-bit to 64-bit transition and the multicore transition. As always, software lagged these hardware transitions; you could put 8 or 16GB of RAM in a server, but good luck getting Java to use it. So there was a period of several years where the ceiling on vertical scalability was both quite low and absurdly expensive. Meanwhile, hard drives and the internet got big.
3 replies →
The problem is that you do want some horizontal scaling regardless, just to avoid SPOFs as much as you can.
15 replies →
Plus horizontal scaling is sexier
2 replies →
One of my favorite posts. I'll always upvote this. Of course there are use cases one or two standard deviations outside the mean that require truly massive distributed architectures, but not your shitty csv / json files.
Reflecting on a decade in the industry I can say cut, sort, uniq, xargs, sed, etc etc have taken me farther than any programming language or ec2 instance.
Related:
https://news.ycombinator.com/item?id=12472905 - 7 years ago (171 comments)
I think you have an off by one year error in these.
My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.
It’s been 8 years and I think RDBMS is stronger than ever?
Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.
https://twitter.com/donatj/status/740210538320273408
Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.
Spark is still pretty non performant.
If the workload fits in memory and a single machine, DuckDb is so much more lightweight and faster.
4 replies →
In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.
Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.
1 reply →
These posts always remind me of the [Manta Object Storage](https://www.tritondatacenter.com/triton/object-storage) project by Joyent. This project was basically a combination of object storage with the added ability to run arbitrary programs against your data in situ. The primary, and key, difference being that you kept the data in place and distributed the program to the data storage nodes (the opposite of most data processing as I understand it), I think of this as a superpowered version of using [pssh](https://linux.die.net/man/1/pssh) to grep logs across a datacenter. Yet another idea before its time. Luckily, Joyent [open sourced](https://github.com/TritonDataCenter/manta) the work, but the fact that it still hasn't caught on as "The Way" is telling.
Some of the projects I remember from the Joyent team were: dumping recordings of local mariokart games to manta and running analytics on the raw video to generate office kart racer stats, the bog standard dump all the logs and map/reduce/grep/count them, and I think there was one about running mdb postmortems on terabytes of core dumps.
On a similar reasoning, in 2008 or such, I observed that, while our Java app would be able to run more user requests per second than our Python version, it’d take months for the Java app to overtake the Python one in total requests served because it’d have to account for a 6 month head start.
Far too often we waste time optimising for problems we don’t have, and, most likely, will never have.
…yes - processing 3.2G of data will be quicker on a single machine. This is not the scale of Hadoop or any other distributed compute platform.
The reason we use these is for when we have a data set _larger_ than what can be done on a single machine.
Most people who wasted $millions setting up Hadoop didn’t have data sets larger than could fit on a single machine.
I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.
I completely agree. I love the tech and have spent a lot of time in it - but come on people, let’s use the right tool for the right job!
Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.
I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.
Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?
9 replies →
Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.
This is exactly the point of the article. From the conclusion:
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
What can be done on a single machine grows with time though. You can have terabytes of ram and petabytes of flash in a single machine now.
This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.
Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.
5 replies →
The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.
Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.
1 reply →
I bought a Raspberry Pi 4 for Christmas. It's connected to my dev laptop directly via wired connection. My self imposed challenge for this year is to try to offload as much work to this little Pi as I can. So I'm a fan of this approach.
Even in with large scale data Hadoop/Spark tend to be used in ways that makes no sense, as if something being self described as big data means that as soon as you cross some threshold you SHOULD be using it.
Recently had an argument with a senior engineer on our team because a pipeline that processed several PB of data, scaled to +1000 machines and was all account a success was just a Python script using multiprocessing distributed with ECS and didn't use Spark.
Common command line tools are often the best for analyzing and understanding HPC clusters and issues. People have often asked me for tools and web pages to figure out how to understand and figure out issues in our cluster, or asked if we could use some tool like Hadoop, Spark, or some Azure/GCP/AWS tool to do it faster. I've said that if they want to spend the effort to use those tools, it could be valuable; but if it takes me 10min to use those tools and <1min using command line tools, I'll always fall back to the command line.
That's not to say that fancy tools don't have their use; but people often forget how much you can do with a few simple commands if you understand a pipeline and how the commands work.
current day version - if you got less than 10K - 100K vectors use numpy.dot instead of the vector store databases
Even if you have a lot more than that you could easily use SQLite with FAISS. It works great.
Bingo. This combination is underappreciated.
1 reply →
You can easily add two zeros to that.
There was some environment (Ice surface movements in the South Pole I think) related researcher who rewrote his calculations from Nvidia and GPU computing to a plain C file. The NV task lasted for months; later, seconds.
This is one of my favourite posts. Part of my PhD was based on this post. https://discovery.ucl.ac.uk/id/eprint/10085826/ (Section 4.1). I presented this section of my research in a conference and won best paper award as well.
Another post I love is https://stackoverflow.com/questions/2908822/speed-up-the-loo... where the guy manages to speed up a function which will run for days to milliseconds.
If we write dedicated tools, speeds boost can be enormous. We can process 1 billion rows from a simple CSV in just 2 seconds. In slow Java. It just requires some skills, which is hard to find nowadays.
https://github.com/gunnarmorling/1brc
Like a lot of things, people tend to make a decision for horizontal vs vertical and then stick with it even as the platforms or "physics" change underneath them over time. Same for memory bandwidth (which people, like Sun, thought would remain more of a bottleneck than it actually turned out to be).
The raise of single node computing is powered by two things:
- desktop computers are really powerful (Apple Mx, AMD Epyc etc.)
- software like Polars
What is the largest data set people here are processing daily for ETL on one machine? What tools are you using, and what does the job do? I want to know how capable new libraries like polars are, and how far you can delay transitioning to Spark. Are terabyte datasets feasible yet?
I remember reading this article several years ago. Good to see it again. I remember when everyone thought their data was big data. How provincial we were.
It’s worth noting nvme and ssds make this possible. If this were off an hdd this approach would likely be slower.
272MB/s is "consumer spinning rust RAID 5 via USB3" speeds. I know this because that's roughly what my NAS backup machine does on writes/reads. From a NAS on gbit it's overpowered by 150%, for sure.
That also means using Hadoop makes sense only when cluster size > 235 machines.
More succinct version of the same from Gary Bernhardt of WAT fame (from 2015, same era) https://twitter.com/garybernhardt/status/600783770925420546
> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.
Actually the awk solution in the blog post doesn’t load the entire dataset into memory. It is not limited by RAM. Even if you make the input 100x larger, mawk will still be hundreds of times faster than Hadoop. An important lesson here is streaming. In our field, we often process >100GB data in <1GB memory this way.
This. For many analytical use cases the whole dataset doesn't have to fit into memory.
Still: of course worthwhile to point out how oversized a compute cluster approach is when the whole dataset would actually fit into memory of a single machine.
Reminds me of one of my favourite twitter posts:
> Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM.
https://twitter.com/DEVOPS_BORAT/status/299176203691098112
DEVOPS_BORAT contains a lot of truth if you think about it, hah.
Our sarcastic team-motto is very much this: https://twitter.com/DEVOPS_BORAT/status/41587168870797312
https://yourdatafitsinram.net/
I see servers with more RAM, 32 TB[1], 48TB RAM[2], probably more at: https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...
[1]https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...
[2]https://buy.hpe.com/us/en/compute/mission-critical-x86-serve...
1 reply →
It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'. The finding on OP's website, of what is or isn't fast for what purpose, should not be surprising us, 79 years after the first programmable computer. Yet we go about our work, blissfully ignorant of what the actual capabilities of what we're doing are.
We don't have hypotheses, experiments, and results published, of what a given computing system X, made up of Y, does or doesn't achieve. There are certainly research papers, algorithms and proof-of-concepts, but (afaict) no scientific evidence for most of the practices we follow and results we get.
We don't have engineering specifications or tolerances for what a given thing can do. We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do. We don't even have institutional knowledge of all the problems a given engineering effort faces, and how to avoid those problems. When we do have institutional knowledge, it's in books from 4 decades ago, that nobody reads, and everyone makes the same mistakes again and again, because there is no institutional way to hold people to account to avoid these problems.
What we do have, is some tool someone made, that then millions of dollars is poured into using, without any realistic idea whatsoever what the result is going to be. We hope that we get what we want out of it once we're done building something with it. Like building a bridge over a river and hoping it can handle the traffic.
There are two reasons creating software will never (in my lifetime) be considered an engineering discipline:
Modern engineering best practice is "follow the standards". The standards were developed in blood -- people were either injured or killed, so the standard was developed to make sure it didn't happen again. In today's society, no software defects (except maybe aircraft and medical devices) are considered severe enough for anyone to call for the creation and enforcement of standards. Even Teslas full-self-driving themselves into parked fire trucks and killing the occupants doesn't seem enough.
Engineers that design buildings and bridges also have an advantage not available to computers: physics doesn't change, at least not at scales and rates that matter. When you have a stable foundation it is far easier to develop engineering standards on that foundation. Programmers have no such luxury. Computers have only been around for less than 100 years, and the rate of change is so high in terms of architecture and capabilities that we are constantly having to learn "new physics" every few years.
Even when we do standardize (e.g. x86 ISA) there is always something bubbling in research labs or locked behind NDAs that is ready to overthrow that standard and force a generation of programmers into obsolescence so quickly there is no opportunity to realistically convey a "software engineering culture" from one generation to the next.
I look forward to the day when the churn slows down enough that a true engineering culture can develop.
I want to echo the 'new physics' idea.
Imagine what scenario we would be in if they laid down the Standards of Software Engineering (tm) 20 years ago. Most of us would likely be chafing against guidelines that make our lives much worse for negative benefit.
In 20 years we'll have a much better idea of how to write good software under economic constraints. Many things we try to nail down today will only get in the way of future advancements.
My hope is that we're starting to get close though. After all, 'general purpose' languages seem to be converging on ML* style features.
* - think standard ML not machine learning. Static types, limited inference, algebraic data types, pattern matching, no null, lambdas, etc.
1 reply →
"It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'"
It may not have been clear in 2014, but it is now: Data scientists are not computer scientists or software engineers. So tarring software engineers with data scientists practices is really a low blow. Not that we're perfect by any means, but that data point you're drawing a line through isn't even on the graph you're trying to draw.
I was unlucky enough to brush that world about a year ago. I am grateful I bounced off of it. It was surreal how much infrastructure data science has put into place just to deal with their mistake of choosing Python as their fundamental language. They're so excited about the frameworks being developed over years to do streaming of things that a "real" compiled language can either easily do on a single node, or could easily stream. They simply couldn't process the idea that I was not excited about porting all my code to their streaming platform because my code was already better than that platform. A constant battle with them assuming I just must not Get It and must just not understand how awesome their new platforms were, and me trying to explain how much of a downgrade it was for me.
"We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do."
Yeah, we do, actually. I use this sort of stuff all the time. Anyone who works competently at scale does, it's a basic necessity for such things. Part of the mismatch I had with the data scientists was precisely that I had this information and not only did they not, they couldn't even process that it does exist and basically seemed to assume I must just be lying about my code's performance. It just won't take the form you expect. It's not textbooks. It can't be textbooks. But that's not the criterion of whether such data exists.
We do actually have some methods of calculating an expected performance. For instance we know that a Zen4 CPU can do 4 256 bit operations per clock, with some restrictions on what combinations are allowed. We are never going to hit 4 outright in real code, but 3.5 is a realistic target for well optimised code. We can use 1 instruction to detect newline characters within those 32 bytes, then a few more to find the exact location, then a couple to determine if the line is a result, and a few more to extract that result. Given a high density of newlines this will mean something on the order of 10 instructions per 32 B block searched. Multiply the numbers and we expect to process approximately 11 B per clock cycle. On a 5 GHz CPU that would mean we would expect to be done in 32 ms, give or take. And the data would of course need to be in memory already for this time to be feasible, as loading it from disk takes appreciably longer.
Of course you have to spend some effort to actually get code this fast, and that probably isn't worth it for the one-shot job. But jobs like compression, video codecs, cryptography and that newfangled AI stuff all have experts that write code in this manner, for generally good reasons, and they can all ballpark how a job like this can be solved in a close to optimal fashion.
The way I see it is that we're in an era analogous to what came immediately after alchemy. We're all busy building up phlogiston like theories that will disprove themselves in a decade or two.
But this is better than where we just came from. Not that long ago, you would build software by getting a bunch of wizards together in a basement and hope they produce something that you can sell.
If things feel worse (I hope) that's because the rest of us muggles aren't as good as the wizards that came before us. But at least we're working in a somewhat tractable fashion.
The mathematical frameworks for construction were first laid out ~1500s (iirc). And people had been doing it since time immemorial. The mathematics for computation started about 1920-30s. And there's currently no mathematics for the comprehensibility of blocks of code. [Sure there's cyclomatic complexity and Weyuker's 9 properties, but I've got zero confidence in either of them. For example, neither of them account for variable names, so a program with well named variables is just as 'comprehensible' as a program with names composed of 500MB of random characters. Similarly, some studies indicate that CC has worse predictive power of the presence of defects than lines of code. And from what I've seen in Weyuker, they haven't shown that there's any reason to assume that their output is predictive of anything useful.]
Isn't a program really a scientific experiment? About as pure as you can get. It's not formalized as such but that's trivial.
It can be, but usually isn't. Similarly, dropping a feather and a bowling ball simultaneously might be science, or might not be. Did I make observations? Or am I just delivering some things to my friend at the bottom?
I for one does not miss having "big data" being mentioned in every meeting, talk, memo, etc. Sure it's AI now, but even that doesn't become as annoying and misunderstood as the big data fad was.
It's all cycle. Remember when XML was the future ?
Money quote from https://www.bitecode.dev/p/hype-cycles:
> geeks think they are rational beings, while they are completely influenced by buzz, marketing, and their emotions. Even more so than the average person, because they believe they are less susceptible to it than normies, so they have a blind spot.
Here I am seated in meetings discussing MACH architectures and serverless, while thinking about "The Network is the Computer" in Sun's manuals.
3 replies →
This made me remember the amazing “parable of the languages” that had XML as the main antagonist of the story. We need an AI update for this one.
https://burningbird.net/the-parable-of-the-languages/
> but even that doesn't become as annoying and misunderstood as the big data fad was.
Must be nostalgia. AI is much, much worse. And, even more importantly, not only it is a annoying buzzword, it is already threatening lives (see the mushroom guide written by AI) and democracies (see the "singing Modi" and "New Hampshire Officials to Investigate A.I. Robocalls").
Also both OpenAI and Anthrophic argued if licenses were required to train LLMs on copyrighted content, today’s general-purpose AI tools simply could not exist.
This might be naive, but I agree that AI hype will never be as annoying as Big Data hype.
At least 90% when people mention wanting to use AI for something, I can at least see why they think AI will help them (even if I think it will be challenging in practice).
99% of the time when people talk about big data it is complete bullshit.
AI is arguably the most well known MapReduce that we have, today, though
We need bigger big data (tm) to feed our Ai.
So I can use a command line tool to process queries that are processing 100 TB of data? The last time I used Hadoop it was on a cluster with roughly 8PB of data.
Let me know when I can do it locally.
"Can be 235x faster" != "will always be 235x faster", nor indeed "will always be faster" or "will always be possible".
The point is not that there are no valid uses for Hadoop, but that most people who think they have big data do not have big data. Whereas your use case sounds like it (for the time being) genuinely is big data, or at least at a size where it is a reasonable tradeoff and judgement call.
To people's beliefs on this, here's a Forbes article on Big Data [1] (yes, I know Forbes is now a glorified blog for people to pay for exposure). It uses as example a company with 2.2 million pages of text and diagrams. Unless those are far above average, they fit in RAM on a single server, or on a small RAID array of NVMe drives.
That's not Big Data.
I've indexed more than that as a side-project on a desktop-class machine with spinning rust.
The people who think that is big data are the audience of this, not people with actual big data.
[1] https://www.forbes.com/sites/forbestechcouncil/2023/05/24/th...
for the given question, sure.. you can
there are 60tb ssd's out there.. you might even fit all of the 8tb on a given server
Unix has a split(1) tool.
Have you read the article ?