I feel like every time something like this comes up people completely skip over the benefit of having as much of your data processing jobs in one ecosystem as possible.
Many of our jobs operate on low TBs and growing but even if the data for a job is super small I'll write it in Hadoop (Spark these days) so that the build, deployment, and schedluing of the job is handled for free by our curent system.
Sure spending more time listing files on S3 at startup than running the job is a waste but far less than the man hours to build and maintain a custom data transformation.
The main benefit of these tools is not the scale or processing speed though. The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.
Having worked at a large company and extensively used their Hadoop Cluster, I could not agree more with you.
The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark.
The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped!
Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post.
> you cannot expect all of them to SSH into a single machine
Why not?
I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.
> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]
This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.
And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).
If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.
Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point:
"At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines."
Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place.
> .. cannot expect all of them to SSH into a single machine ..
That's pretty much how the Cray supercomputer worked at my old university. SSH to a single server containing compilers and tooling. Make sure any data you need is on the cluster's SAN. Run a few cli tools via SSH to schedule job, and bam - a few moments later your program is crunching data on several tens of thousands of cores.
But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?
Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...
This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.
I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.
These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!
> [...] the benefit of having as much of your data processing jobs in one ecosystem as possible. [...] The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.
Yep! Elasticity is a pretty nice benefit.
Sure, if you're processing a few gigabytes of data, then you could do that with shell scripts on a single machine. However, if you want to build a system that you can "set and forget", that will continue to run over time as data sizes grow arbitrarily, and that -- as you say -- can be fault tolerant, then distributed systems are nice for that purpose. The same job that handles the few gigabytes of data can scale to petabytes if needed. The same techniques that handle gigabytes scale to petabytes.
A job running on a single machine with shell scripts will eventually reach a limit where the data size exceeds what it can handle reasonably. I've seen this happen repeatedly first hand, to the extent that I'd be reluctant to use this approach in production unless I needed something really quick-n-dirty where scaling isn't a concern at all. Another problem with these single-machine solutions is their reliability. If it's for production use, you really want seamless, no-humans-involved failover, which isn't as straightforward to achieve with the single-machine approach unless you deploy somewhat specialized technology (it ends up being something like primary/standby with network attached storage).
Plus, in an environment where you have one job processing GiBs of data, you tend to have more. While any single solo job handling GiBs of data could be done locally, once you have a lot of them, accessed by many different people at a company and under different workflows, the value of distributed data infrastructure starts to make more sense.
Neat article though. Always good to have multiple techniques up your sleeve, to use the right one for the problem at hand.
What happens if a consultant approaches a company that uses Hadoop and offers them "custom data transformation" solutions for their most frequent processing jobs at lower cost and that beat Hadoop's processing times 100-fold?
The company saves money by choosing the lower cost and time by not having to wait for slow Hadoop processing.
It seems like there is competitive advantage to be gained and money to be made by taking advantage of Hadoop's inefficiencies.
Most "big data" problems are really "small data" by the standards of modern hardware:
* Desktop PCs with up to 6TB of RAM and many dozens of cores have been available for over a year.[1]
* Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced.[2]
CORRECTION: THE FIGURE IS 60TB, NOT 100TB. See MagnumOpus's comment below. In a haste, I searched Google and mistakengly linked to an April Fool's story. Now I feel like a fool, of course. Still, the point is valid.
* Four Nvidia Titan X GPUs can give you up to 44 Teraflops of 32-bit FP computing power in a single desktop.[3]
Despite this, the number of people who have unnecessarily spent money and/or complicated their lives with tools like Hadoop is pretty large, particularly in "enterprise" environments. A lot of "big data" problems can be handled by a single souped-up machine that fits under your desk.
> Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced
That is an April Fools story.
(Of course you can still get a Synology DS2411+/DX1211 24-bay NAS combo for a few thousand bucks, but it will take up a lot of space under your desk and keep your legs toasty...)
Right, "data scientists" with experience call themselves statisticians or analysts or whatever. The "data science" or "big data" industry is comprised of people who just think a million rows of data sounds impressively big because they never experienced data warehouses in the 1990s where a million rows was not even anything special...
The first paying job we ran through our Hadoop cluster in 2011 had 12 billon rows, and they were fairly big rows. This was beyond the limit of what our proprietary MPP database cluster could handle in the processing window it had (being fair the poor thing was/is loaded 90%+ which is not a great thing, but a true thing for many enterprises). We couldn't get budget for the scaling bump we hit with the evolution of that machine, but we could pull together a six node Hadoop machine and lo and behold, for a pittance we got a little co-processor that could. One other motivation was/is that use case accumulates 600m rows a day, and we were then able to engineer (cheap) a solution that can hold 6mths of that data vs 20 days. After 6mths our current view is that it's not worth keeping the data, but we are beginning to get cases of regret that we've ditched longer window stuff.
There are queries and treatments that process 100's of billions of substantial database rows on other cheap open source infrastructures, and you can buy proprietary data systems that do it as well (and they are good) but if you want to do it cheaply and flexibly then so far I think that Hadoop wins.
I think that Hadoop won 4 years ago and has been the centre of development every since (in fact before when MS cancelled Dryad) I think it will continue to be the weapon of choice for at least 6 more years and will be around and important for 20 more after that. My only strategic concern is the filesystem splintering that is going on with HDFS/Kudu.
So you have large data storage, and processing that can handle large data (assuming for convenience that you have a conventional x86 processor with that throughput). The only problem that remains is moving things from the former to the latter, and then back again once you're done calculating.
That's (100 * 1024 GB) / (20 GB/s) = 85 minutes just to move your 100 TB to the processor assuming your storage can operate at the same speed as DDR4 RAM. A 100 node Hadoop cluster has (100 * 1024 GB) / (0.2 * 100 GB/s) throughput with commodity disks.
Back-of-the-envelope stuff, obviously, with caveats everywhere.
Problem with that kind of setup is that if you unexpectedly need to scale out of that, you haven't done any of the work required to do that, and you're stuck.
How often do you "unexpectedly need to scale out"? By an order of magnitude at least that is, because under that you could add a few more of those beefed-up machines.
I wonder what happened with YAGNI principle. It has arguable uses in some places, but this one it seems to fit perfectly.
Yes, there are desktops with high amounts of Ram but to buy a machine like that would probably be more than setting up a hadoop cluster on commodity hardware. And for embarrassingly parallel problem, hadoop can scale semi-seemlessly.
In reality, it still takes work... but can be done.
This idea was the subject of a paper at a major systems conference. The paper is called "Scalability! But at what cost?" - It goes well beyond this simple example above to explore how most major systems papers produce results that can be beaten by a single laptop. Here's the paper and the blog post describing it.
I love this. I hope the 'COST' measurement takes off.
I'm not going to go as far as some in condemning the latest frameworks, but I do agree that they are often chosen with no concept of the overhead imposed by being distributed.
Is there anything similar comparing 'old school' distributed frameworks like MPI to the new ones like Spark. I'm curious how much of the overhead is due to being distributed, network latency and Amdel's law, versus the overhead from the much higher level, and more productive, framework itself.
They use Rust, fantastic! Will put this on my must-read-list. Based on their graphs, it makes one wonder how much literal energy has been wasted using 'scalable' but suboptimal solutions... Of course if you're wishing to start a company competing on data processing (e.g. small IoT startups), being a bit cleverer could let you have the same performance or feature set with 1/10th the overhead costs. So maybe don't let too many people know? ;)
The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that.
> The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010.
We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.
> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.
And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).
Occasionally designers seem to seek credit merely for possessing a new technology, rather than using it to make better designs. Computers and their affiliated apparatus can do powerful things graphically, in part by turning out the hundreds of plots necessary for good data analysis. But at least a few computer graphics only evoke the response "Isn’t it remarkable that the computer can be programmed to draw like that?" instead of "My, what interesting data".
Last paragraph of the article: "Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools"
Not necessarily true. Depending on your use cases, it often still makes sense to use Hadoop. A really common scenario is that you'll implement your 3.5 GB job on one box, then you'll need to schedule it to run hourly. Then you'll need to schedule 3 or 4 to run hourly. Then your boss will want you to join it with some other dataset, and run that too with the other jobs. You'll eventually implement retries, timeouts, caching, partitioning, replication, resource management, etc.
You'll end up with a half assed, homegrown Hadoop implementation, wishing you had just used Hadoop from the beginning.
I'd rather pay for 7 c4.large* instances ($.70/hour for all of them) compared to an x1 ($13.38/hour).
*The original article is from 2014 and references c1.medium which are no longer available. c4.large is the closest you can get without dipping down into the t2 class.
...And when you do have 140 TB of chess data, you can move to Manta, and you get to keep your data processing pipeline almost exactly the same. Upwards scalability!
I don't know how the performance would stack against Hadoop, but it'd work.
Actually, just posted essentially the same thing, before reading your comment. I'm wondering as well how the performance would/will scale. It likely depends on how the data is scattered / replicated, but presumable they've worked out decent schedulers for the system. If not, it is open source! Lovin it.
Well, all the code running against the data would already have the paralellization advantages of a shell script, as described in this article. It would additionally probably be running accross multiple nodes, meaning that the IO speeds increase the number of records that can be processe)d simultaneously. The disadvantage is that that data has to be streamed over a network to the reducer node, which could add a good chunk of latency, depending on how fast that is (if you can do some reduction during the map, it would help, but it's possible that Manta spawns one process and virtualized nods per object (and indeed, this is likely), meaning this is impossible), and how many virtual nodes are running on the same physical hardware (but then you're running into the same boundaries you hit on a laptop, just on a much beefier system), as the network latency is near zero if the reducer and the mapper nodes are on the same physical system.
But if you're processing terrabytes, the network latency is probably barely factoring into your considerations, given how much time you're saving by processing data in parallel in the first place.
Bane's "deep" approach saved tens of millions of dollars in one example, months of time in another. In all cases it was many times easier for the team to keep up, day after day, year after year.
Bane advocates good old-fashioned refactoring. With humble tools like Perl, SQL, and an old desktop, he bested new, fancy, expensive products. Bane deserves the salary of a CEO, or at least a vice president, for the good he has brought to his company.
I think ego leads us to make choices that appeal in the short run but are bad in the long run. Which is more impressive sounding: that you bought a distributed network of a thousand of the newest, shiniest machines, running the latest version of DataLaserSharkAttack; or that you cobbled together some Perl, SQL, and shell one-liners on a four-year-old PC?
Also good-old fashioned hard work is painful. It is a good kind of pain, like working out your body, rather than a bad kind of pain, like accidentally cutting yourself. But it is just good old-fashioned, humble, hard work to sit down, work through the details, and come up with a better plan.
Before that, it is even more humble, hard work, to learn the things that Bane had learned. Not just anybody could have done what he did. First you have to learn the ins and outs of Perl, SQL, and all the little shell commands, and all their little options. He knows about a lot of different programming problems, like what a "semantic graph" is (I can't say I do), what an "adjacency matrix" is (nope), whether something is an O(N^2) problem or an O(k+n^2) problem (I know I've seen that notation before).
Arguable if you can keep everything on one box it will almost always be faster (and cheaper!) than any soft of distributed system. That said, scalability is generally more important than speed because once a task can be distributed you can add performance by adding hardware. As well, depending on your use case you can often get fault tolerance "for free".
An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.
My hadoop experience is dated (circa 2011), do the work nodes still poll the scheduler to see if they have work to do? If so, that's still a giant impediment to speed for smaller tasks. Especially if poll times are in the range of minutes.
If hadoop put effort into making small tasks time efficient, I think your argument has merit, if there's a reasonable chance of actually needing to scale, or to pick up ancillary benefits (fault tolerance, access to other data that needs to be processed with hadoop etc)
There is nothing preventing distributed systems to be faster than one box for this kind of thing. But they don't always bother to pursue efficiency on that level, because things are very different once you have a lot of boxes and something that used to look important for a couple of boxes doesn't anymore.
If you like to do data analyses in bash, you might also enjoy bigbash[1]. This tool generates quite performant bash one-liners from SQL Select statements that easily crunch GB of csv data.
Do you think you can get it to support Manta? I think a lot of people in that ecosystem could benefit from it if you could. I'd help, but I don't really know Java all that well :-(.
It's all about picking the right tool for the job. I think shell scripting is a great prototyping tool and often a good place to start. As the problem gets more complex and bigger, eventually it will warrant a full scale development.
I think people overlook the fact that the author made an even more strong point by using shell scripting, which is relatively inefficient compared to using a compiled language. I guess it would hit the I/O cap without even going parallel.
I am not a Big Data expert, but does that change any of the comments below with reference to large datasets and memory available?
I use J and Jd for fun with great speed on my meager datasets, but others have used it on billion row queries [1]. Along with q/kdb+, it was faster than Spark/Shark last I checked, however, I see Spark has made some advances recently I have not checked into.
J is interpreted and can be run from the console, from a Qt interface/IDE, or in a browser with JHS.
There isn't exactly a direct relationship between the size of the data set and the amount of memory required to process it. It depends on the specific reporting you are doing.
In the case of this article, the output is 4 numbers:
games, white, black, draw
Processing 10 items takes the same amount of memory as processing 10 billion items.
If the data set in this case was 50TB instead of a few GB, it would benefit from running the processing pipeline across many machines to increase the IO performance. You could still process everything on a single machine, it would just take longer.
Some other examples of large data sets+reports that don't require a large amount of memory to process:
* input: petabytes of web log data. output: count by day by country code
* input: petabytes of web crawl data. output: html tags and their frequencies
* input: petabytes of network flow data. output: inbound connection attempts by port
Reports that require no grouping (like this chess example) or group things into buckets with a defined size (ports that are in a range of 1-65535) are easy to process on a single machine with simple data structures.
Now, as soon as you start reporting over more dimensions things become harder to process on a single machine, or at least, harder to process using simple data structures.
* input: petabytes of web crawl data. output: page rank
* input: petabytes of network flow data. output: count of connection attempts by source address and destination port
I kinda forget what point I was trying to make.. I guess.. Big data != Big report.
I generated a report the other day from a few TB of log data, but the report was basically
for day in /data/* ; do #YYYY-MM-DD
echo $day $(zcat $day/*.gz | fgrep -c interestingValue)
done
There's a lot of operational benefits to running on Hadoop/yarn as well. You get operational benefits from node resiliency (host went down? Run the application over there). You also get the Hadoop filesystem which conveniently stores your data in S3 and distributed HDFS.
These systems were designed by people who probably managed difficult etl pipelines that were nothing but what the author suggests: simplified shell scripts using UNIX pipes.
Besides going up against Hadoop MR is easy. I'd like to see you compete against something like Facebook's presto or spark which are optimized for network and memory.
Kind of depends on what the 10 GB is. For example, on my project, we started on files that were about 10 GB a day. The old system took 9 hours to enhance the data (add columns from other sources based on simple joins). So we did it with Hadoop on two Solaris boxes (18 virtual cores between them). Same data; 45 minutes. But wait there's more.
We then created a two fraud models that took that 10+ GB file (enhancement added about 10%) and executed within about 30 minutes a piece. But concurrently. All on Hadoop. All on arguably terrible hardware. Folks at Twitter and Facebook had never though about using Solaris.
We've continued this pattern. We've switched tooling from Pig to Cascading because Cascading works in the small (your PC without a cluster) and in the large. It's testable with JUnit in a timely manner (looking at you PigUnit). Now we have some 70 fraud models chewing over anywhere from that 10+ GB daily file set to 3 TB. All this in our little 50 node cluster. All within about 14 hours. Total processed data is about 50 TB a day.
As pointed out earlier, Hadoop provides an efficient, scalable, easy distributed application development platform. Cascading makes life very Unix-like (pipes and filters and aggregators). This coupled with a fully async eventing pipe line for workflows built on RabbitMQ makes for an infinitely expandable fraud detection system.
Since all processors communicate only through events and HDFS, we add new fraud models without necessarily dropping the entire system. New models may arrive daily, conform to a set structure, and are literally self-installed from a zip file within about 1 minute.
We used the same event + Hadoop architecture to add claim line edits. These are different from fraud models in that fraud models calculate multiple complex attributes then apply a heuristic to the claim lines. Edits look at a smaller operation scope. But in cascading this is pipe from HDFS -> filter for interesting claim lines -> buffer for denials -> pipe to HDFS output location.
Simple, scalable, logical, improvable, testable. I've seen all of these. As the community comes out with new tools, we get more options. My team is working on machine learning and graphs. Mahout and Giraph. Hard to do all of this easily with a home grown data processing system.
As always, research your needs. Don't get caught up in the hype of a new trend. Don't be closed minded either.
i agree that scalable infrastructure is needed to manage a production pipeline, as others have explained well.
i found this article was a useful reminder, because sometimes a job doesnt require a fully grown infrastructure. i commonly get these requests that dont overlap with existing infrastructure and wont need any followup. in that particular case a hadoop cluster, heck even loading into a pg db would be wasted effort.
but i wouldnt want to manage our clickstream analytics pipeline with shell scripts and cron jobs.
is there any lightweight tooling out there that can schedule/run basic pipeline jobs in a shell environment?
In my experience you can classify people into two herds: those who, when faced with a problem, solve it directly; and those who, faced with the same problem, try to fit it to the tools they want to use. I like to think this is a maturity question, but I can't think I've actually seen someone make the transition from the latter to the former type.
I think that's his point. Companies are chasing "big data" because it's a great buzzword without considering whether it's something they actually need or not.
A well-rounded, hype-resistant developer would look at the same problem and say, "wha? Nah, I'll just write a dozen lines of PowerShell and have the answer for you before you can even figure out AWS' terms of use..."
I don't think the article talks about this specifically but there's also a tendency to say "big data" when all you need is "statistically-significant data". If you're Netflix, if you just want to figure out how many users watch westerns for marketing purposes, you don't need to check literally every user, just a large enough sample so that you can get a 95% confidence or so. But I've seen a lot of companies use their "big data" tools to get answers to questions like that, even though it takes longer than just sampling the data in the first place.
(Now Netflix recommendations, that's a big data problem because each user on the platform needs individualized recommendations. But a lot of problems aren't. And it takes that well-rounded hype-resistant guy to know which are and which aren't.)
I guess the author should have called it out more explicitly for some, but I think that's the point.
I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools. If you have a huge amount of data or really need distributed processing, then tools like Hadoop may be required, but more often than not these days I see Hadoop used where a traditional relational database or other solutions would be far better in terms of performance, cost of implementation, and ongoing maintenance.
I am a bioinformatician. 130tb of raw reads or processed data? Are you trying to build a general purpose platform for all *-seq or focusing on something specific (genotyping)?
I think you might be replying to my comment. We just took delivery of a 20K WGS callset that is 30TB gzip compressed (about 240TB uncompressed) and expect something twice as big by the end of the year. We're trying to build something pretty general for variant level data (post calling, no reads), annotation and phenotype data. Currently we focus on QC, rare and common variant association and tools for studying rare disease. Everything is open source, we develop in the open and we're trying hard to make it easy for others to develop methods on our infrastructure. Feel free to email if you'd like to know more.
Note the magic words were "can be faster", not "are faster".
If you'd read the entire article you'd even have picked up that he's explicitly calling out use of hadoop for data that easily fits in memory, not large data sets.
It's about four hours on a on prem commodity cluster with ~PB raw storage on 22 nodes. Each node has 12 4TB disks (fat twins) and two xeons with (I think) 8 cores, and 256GB ram. It's got a dedicated 10GbE network with it's own switches.
The processing is a record addition per item (basically there is a new row for a matrix for every item we are examining, and an old row has disappeared) and a recalculation of aggregate statistics based on the old data and the new data - the aggregate numbers are then off loaded to a front end database for online use. The calculations are not rocket science so this is not compute bound in any way.
I think that we can do it because of data parallelism, the right data is available on each node and every core and every disk, so each pipeline just thumps away at ~50Mbs, there are about 300 of them so that's lots of crunching. At the moment I can't see why we can't scale more, although I believe that the max size of the app will be no more than *2 where we are now.
That depends on the tradeoff between management/transfer overhead and actually doing work.
Always in the "word count" style examples, but quite often in real life, the "get the data into the process" takes more time than actually processing it.
When you need to distribute, you need to distribute. However, the point where "you need to distribute" is about 100x more data than the time most hadoop users do, and the overhead costs are far from negligible - in fact, they dominate everything until you get to 100x more data.
This article is a great litmus test for checking if someone has experience working at scale (Multi Terabytes, Multiple analysts, Multiple job types) or not. Anyone who has had that experience will instantly describe why this article is wrong.
It's akin to saying a Tesla is faster than Boeing 777 on a 100 meter track.
I'd hope people who have worked at scale still are capable of recognizing when the tools they used there are totally overkill. I'd suspect they would, since they'd also be more aware of their limitations (vs somebody without experience, who has to believe the "you need big data and everything is easy" marketing).
That you wouldn't use a Boing 777 IF your problem is just a 100m track is the entire point of the article. It's explicitly not saying that you never should use the big tools.
They are not overkill at all, rather they are tuned towards different set of performance characteristics. E.g. in the Boeing 777 example above, transatlantic journey.
In the article above, the data and results stay on the local disk, however in any organization, they need to be stored in a distributed manner, available to multiple users with varying levels of technical expertise. Typically in NFS or HDFS, preferably if they are records stored/indexed via Hive/Presto. At which point the real issue is how do you reduce the delay resulting from transferring data over the network. Which is what the original idea (moving computation closer to data) behind Hadoop/MapReduce.
I feel like every time something like this comes up people completely skip over the benefit of having as much of your data processing jobs in one ecosystem as possible.
Many of our jobs operate on low TBs and growing but even if the data for a job is super small I'll write it in Hadoop (Spark these days) so that the build, deployment, and schedluing of the job is handled for free by our curent system.
Sure spending more time listing files on S3 at startup than running the job is a waste but far less than the man hours to build and maintain a custom data transformation.
The main benefit of these tools is not the scale or processing speed though. The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.
Having worked at a large company and extensively used their Hadoop Cluster, I could not agree more with you.
The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark.
The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped!
Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post.
https://prestodb.io/
> you cannot expect all of them to SSH into a single machine
Why not?
I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.
> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]
This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.
And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).
If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.
12 replies →
Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point:
"At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines."
Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place.
8 replies →
> .. cannot expect all of them to SSH into a single machine ..
That's pretty much how the Cray supercomputer worked at my old university. SSH to a single server containing compilers and tooling. Make sure any data you need is on the cluster's SAN. Run a few cli tools via SSH to schedule job, and bam - a few moments later your program is crunching data on several tens of thousands of cores.
But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?
Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...
7 replies →
> For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine.
That's exactly the use case we built Userify for (https://userify.com)
This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.
I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.
5 replies →
These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!
Day 2 problems don't come to mind during days 0 and 1.
2 replies →
> [...] the benefit of having as much of your data processing jobs in one ecosystem as possible. [...] The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.
Yep! Elasticity is a pretty nice benefit.
Sure, if you're processing a few gigabytes of data, then you could do that with shell scripts on a single machine. However, if you want to build a system that you can "set and forget", that will continue to run over time as data sizes grow arbitrarily, and that -- as you say -- can be fault tolerant, then distributed systems are nice for that purpose. The same job that handles the few gigabytes of data can scale to petabytes if needed. The same techniques that handle gigabytes scale to petabytes.
A job running on a single machine with shell scripts will eventually reach a limit where the data size exceeds what it can handle reasonably. I've seen this happen repeatedly first hand, to the extent that I'd be reluctant to use this approach in production unless I needed something really quick-n-dirty where scaling isn't a concern at all. Another problem with these single-machine solutions is their reliability. If it's for production use, you really want seamless, no-humans-involved failover, which isn't as straightforward to achieve with the single-machine approach unless you deploy somewhat specialized technology (it ends up being something like primary/standby with network attached storage).
Plus, in an environment where you have one job processing GiBs of data, you tend to have more. While any single solo job handling GiBs of data could be done locally, once you have a lot of them, accessed by many different people at a company and under different workflows, the value of distributed data infrastructure starts to make more sense.
Neat article though. Always good to have multiple techniques up your sleeve, to use the right one for the problem at hand.
What happens if a consultant approaches a company that uses Hadoop and offers them "custom data transformation" solutions for their most frequent processing jobs at lower cost and that beat Hadoop's processing times 100-fold?
The company saves money by choosing the lower cost and time by not having to wait for slow Hadoop processing.
It seems like there is competitive advantage to be gained and money to be made by taking advantage of Hadoop's inefficiencies.
But then things are not always what they seem.
This assumes that company will be making rational choice, which is rarely the case. The choice is usually made based on "no one got fired by using X".
1 reply →
Most "big data" problems are really "small data" by the standards of modern hardware:
* Desktop PCs with up to 6TB of RAM and many dozens of cores have been available for over a year.[1]
* Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced.[2]
CORRECTION: THE FIGURE IS 60TB, NOT 100TB. See MagnumOpus's comment below. In a haste, I searched Google and mistakengly linked to an April Fool's story. Now I feel like a fool, of course. Still, the point is valid.
* Four Nvidia Titan X GPUs can give you up to 44 Teraflops of 32-bit FP computing power in a single desktop.[3]
Despite this, the number of people who have unnecessarily spent money and/or complicated their lives with tools like Hadoop is pretty large, particularly in "enterprise" environments. A lot of "big data" problems can be handled by a single souped-up machine that fits under your desk.
[1] https://news.ycombinator.com/item?id=12141334
Well basically as soon as big data hit the news everyone stopped doing data and started doing big data.
Same thing with data science. Opened a Jupyter notebook, loaded 100 rows, and displayed a graph - "I am a data scientist".
> Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced
That is an April Fools story.
(Of course you can still get a Synology DS2411+/DX1211 24-bay NAS combo for a few thousand bucks, but it will take up a lot of space under your desk and keep your legs toasty...)
Earlier this year, Seagate were showing off their 60 TB SSD[1] for release next year.
So 100 TB in a single drive isn't too far off.
EDIT: Toshiba is teasing a 100 TB SSD concept, potentially for 2018 [2]
[1] 11th August - http://arstechnica.co.uk/gadgets/2016/08/seagate-unveils-60t...
[2] 10th August - http://www.theregister.co.uk/2016/08/10/toshiba_100tb_qlc_ss...
3 replies →
Putting a 100TB of storage on your network isn't hard though. There are off-the-shelf NAS servers that big - eg http://www.ebuyer.com/671448-qnap-ts-ec2480u-rp-144tb-24-x-6...
Right, "data scientists" with experience call themselves statisticians or analysts or whatever. The "data science" or "big data" industry is comprised of people who just think a million rows of data sounds impressively big because they never experienced data warehouses in the 1990s where a million rows was not even anything special...
The first paying job we ran through our Hadoop cluster in 2011 had 12 billon rows, and they were fairly big rows. This was beyond the limit of what our proprietary MPP database cluster could handle in the processing window it had (being fair the poor thing was/is loaded 90%+ which is not a great thing, but a true thing for many enterprises). We couldn't get budget for the scaling bump we hit with the evolution of that machine, but we could pull together a six node Hadoop machine and lo and behold, for a pittance we got a little co-processor that could. One other motivation was/is that use case accumulates 600m rows a day, and we were then able to engineer (cheap) a solution that can hold 6mths of that data vs 20 days. After 6mths our current view is that it's not worth keeping the data, but we are beginning to get cases of regret that we've ditched longer window stuff.
There are queries and treatments that process 100's of billions of substantial database rows on other cheap open source infrastructures, and you can buy proprietary data systems that do it as well (and they are good) but if you want to do it cheaply and flexibly then so far I think that Hadoop wins.
I think that Hadoop won 4 years ago and has been the centre of development every since (in fact before when MS cancelled Dryad) I think it will continue to be the weapon of choice for at least 6 more years and will be around and important for 20 more after that. My only strategic concern is the filesystem splintering that is going on with HDFS/Kudu.
So you have large data storage, and processing that can handle large data (assuming for convenience that you have a conventional x86 processor with that throughput). The only problem that remains is moving things from the former to the latter, and then back again once you're done calculating.
That's (100 * 1024 GB) / (20 GB/s) = 85 minutes just to move your 100 TB to the processor assuming your storage can operate at the same speed as DDR4 RAM. A 100 node Hadoop cluster has (100 * 1024 GB) / (0.2 * 100 GB/s) throughput with commodity disks.
Back-of-the-envelope stuff, obviously, with caveats everywhere.
Problem with that kind of setup is that if you unexpectedly need to scale out of that, you haven't done any of the work required to do that, and you're stuck.
How often do you "unexpectedly need to scale out"? By an order of magnitude at least that is, because under that you could add a few more of those beefed-up machines.
I wonder what happened with YAGNI principle. It has arguable uses in some places, but this one it seems to fit perfectly.
1 reply →
Yes, there are desktops with high amounts of Ram but to buy a machine like that would probably be more than setting up a hadoop cluster on commodity hardware. And for embarrassingly parallel problem, hadoop can scale semi-seemlessly.
In reality, it still takes work... but can be done.
This idea was the subject of a paper at a major systems conference. The paper is called "Scalability! But at what cost?" - It goes well beyond this simple example above to explore how most major systems papers produce results that can be beaten by a single laptop. Here's the paper and the blog post describing it.
http://www.frankmcsherry.org/assets/COST.pdf
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
I love this. I hope the 'COST' measurement takes off.
I'm not going to go as far as some in condemning the latest frameworks, but I do agree that they are often chosen with no concept of the overhead imposed by being distributed.
Is there anything similar comparing 'old school' distributed frameworks like MPI to the new ones like Spark. I'm curious how much of the overhead is due to being distributed, network latency and Amdel's law, versus the overhead from the much higher level, and more productive, framework itself.
They use Rust, fantastic! Will put this on my must-read-list. Based on their graphs, it makes one wonder how much literal energy has been wasted using 'scalable' but suboptimal solutions... Of course if you're wishing to start a company competing on data processing (e.g. small IoT startups), being a bit cleverer could let you have the same performance or feature set with 1/10th the overhead costs. So maybe don't let too many people know? ;)
The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that.
> The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010.
We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.
> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.
And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).
3 replies →
How many companies out there playing with big data are at least half of the size of Twitter?
2 replies →
> 1.75GB
> 3.46GB
These will fit in memory on modest hardware. No reason to use Hadoop.
The title could be: "Using tools suitable for a problem can be 235x faster than tools unsuitable for the problem"
This is exactly the point he was making.
People have a desire to use the 'big' tools instead of trying to solve the real problem.
People both underestimate the power of their desktop machine and the 'old' tools and overestimate the size of their task.
Occasionally designers seem to seek credit merely for possessing a new technology, rather than using it to make better designs. Computers and their affiliated apparatus can do powerful things graphically, in part by turning out the hundreds of plots necessary for good data analysis. But at least a few computer graphics only evoke the response "Isn’t it remarkable that the computer can be programmed to draw like that?" instead of "My, what interesting data".
- Edward Tufte
Applies to more than just design.
>People have a desire to use the 'big' tools
Not only that, people seems to love claiming that they're "big data", perhaps because it makes them sound impressive and bigger than they are.
Very few of us will ever do projects that justifies using tools like Hadoop and to few us are willing to accept that our data fits in SQLite.
Yeah, someone was telling me they need big data for a million rows. I laughed and said SQLite handles that...
12 replies →
I love it when clients think they need a server workstation
Specs be damned!
I need to start selling boxes
Last paragraph of the article: "Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools"
That was exactly his point.
Not necessarily true. Depending on your use cases, it often still makes sense to use Hadoop. A really common scenario is that you'll implement your 3.5 GB job on one box, then you'll need to schedule it to run hourly. Then you'll need to schedule 3 or 4 to run hourly. Then your boss will want you to join it with some other dataset, and run that too with the other jobs. You'll eventually implement retries, timeouts, caching, partitioning, replication, resource management, etc.
You'll end up with a half assed, homegrown Hadoop implementation, wishing you had just used Hadoop from the beginning.
1TB will fit in memory too: https://aws.amazon.com/blogs/aws/ec2-instance-update-x1-sap-...
I'd rather pay for 7 c4.large* instances ($.70/hour for all of them) compared to an x1 ($13.38/hour).
*The original article is from 2014 and references c1.medium which are no longer available. c4.large is the closest you can get without dipping down into the t2 class.
...And when you do have 140 TB of chess data, you can move to Manta, and you get to keep your data processing pipeline almost exactly the same. Upwards scalability!
I don't know how the performance would stack against Hadoop, but it'd work.
Manta storage service teaser: https://www.youtube.com/watch?v=d2KQ2SQLQgg
In about 20,000 years the chess DB will get that big. Until then grep should be fine.
Actually, just posted essentially the same thing, before reading your comment. I'm wondering as well how the performance would/will scale. It likely depends on how the data is scattered / replicated, but presumable they've worked out decent schedulers for the system. If not, it is open source! Lovin it.
Well, all the code running against the data would already have the paralellization advantages of a shell script, as described in this article. It would additionally probably be running accross multiple nodes, meaning that the IO speeds increase the number of records that can be processe)d simultaneously. The disadvantage is that that data has to be streamed over a network to the reducer node, which could add a good chunk of latency, depending on how fast that is (if you can do some reduction during the map, it would help, but it's possible that Manta spawns one process and virtualized nods per object (and indeed, this is likely), meaning this is impossible), and how many virtual nodes are running on the same physical hardware (but then you're running into the same boundaries you hit on a laptop, just on a much beefier system), as the network latency is near zero if the reducer and the mapper nodes are on the same physical system.
But if you're processing terrabytes, the network latency is probably barely factoring into your considerations, given how much time you're saving by processing data in parallel in the first place.
2 replies →
Reminds me of one of my all time favorite comments about Bane's Rule:
https://news.ycombinator.com/item?id=8902739
And yet a fourth dimension to the problem: time/difficulty multiplier.
Make the best decision given your current circumstances and engineering resources.
Bane's "deep" approach saved tens of millions of dollars in one example, months of time in another. In all cases it was many times easier for the team to keep up, day after day, year after year.
Bane advocates good old-fashioned refactoring. With humble tools like Perl, SQL, and an old desktop, he bested new, fancy, expensive products. Bane deserves the salary of a CEO, or at least a vice president, for the good he has brought to his company.
I think ego leads us to make choices that appeal in the short run but are bad in the long run. Which is more impressive sounding: that you bought a distributed network of a thousand of the newest, shiniest machines, running the latest version of DataLaserSharkAttack; or that you cobbled together some Perl, SQL, and shell one-liners on a four-year-old PC?
Also good-old fashioned hard work is painful. It is a good kind of pain, like working out your body, rather than a bad kind of pain, like accidentally cutting yourself. But it is just good old-fashioned, humble, hard work to sit down, work through the details, and come up with a better plan.
Before that, it is even more humble, hard work, to learn the things that Bane had learned. Not just anybody could have done what he did. First you have to learn the ins and outs of Perl, SQL, and all the little shell commands, and all their little options. He knows about a lot of different programming problems, like what a "semantic graph" is (I can't say I do), what an "adjacency matrix" is (nope), whether something is an O(N^2) problem or an O(k+n^2) problem (I know I've seen that notation before).
thanks for sharing, this is exactly what i've been wondering about
Arguable if you can keep everything on one box it will almost always be faster (and cheaper!) than any soft of distributed system. That said, scalability is generally more important than speed because once a task can be distributed you can add performance by adding hardware. As well, depending on your use case you can often get fault tolerance "for free".
No.
Show that you need scalability first. Chances are you don't.
When you do, scale the smallest possible part of your system that is the bottleneck, not the whole thing.
An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.
9 replies →
My hadoop experience is dated (circa 2011), do the work nodes still poll the scheduler to see if they have work to do? If so, that's still a giant impediment to speed for smaller tasks. Especially if poll times are in the range of minutes.
If hadoop put effort into making small tasks time efficient, I think your argument has merit, if there's a reasonable chance of actually needing to scale, or to pick up ancillary benefits (fault tolerance, access to other data that needs to be processed with hadoop etc)
There is nothing preventing distributed systems to be faster than one box for this kind of thing. But they don't always bother to pursue efficiency on that level, because things are very different once you have a lot of boxes and something that used to look important for a couple of boxes doesn't anymore.
Yes, there is, you have a lot of overhead in any case for the same tools.
2 replies →
Fancy highly-scalable distributed algorithms have that annoying tendency of starting at 10x slower than the most naïve single-machine algorithm.
This seems to be the Manta [0] way. Letting you run your beloved Unix command pipeline on your Object Store files.
[0] https://www.joyent.com/manta But the youtube videos with Bryan Cantrill are even better at explaining.
If you like to do data analyses in bash, you might also enjoy bigbash[1]. This tool generates quite performant bash one-liners from SQL Select statements that easily crunch GB of csv data.
Disclaimer: I am the main author.
[1] http://bigbash.it
That's pretty cool.
Do you think you can get it to support Manta? I think a lot of people in that ecosystem could benefit from it if you could. I'd help, but I don't really know Java all that well :-(.
It's all about picking the right tool for the job. I think shell scripting is a great prototyping tool and often a good place to start. As the problem gets more complex and bigger, eventually it will warrant a full scale development.
I think people overlook the fact that the author made an even more strong point by using shell scripting, which is relatively inefficient compared to using a compiled language. I guess it would hit the I/O cap without even going parallel.
Date on article: Sat 25 January 2014
I am not a Big Data expert, but does that change any of the comments below with reference to large datasets and memory available?
I use J and Jd for fun with great speed on my meager datasets, but others have used it on billion row queries [1]. Along with q/kdb+, it was faster than Spark/Shark last I checked, however, I see Spark has made some advances recently I have not checked into.
J is interpreted and can be run from the console, from a Qt interface/IDE, or in a browser with JHS.
[1] http://www.jsoftware.com/jdhelp/overview.html
There isn't exactly a direct relationship between the size of the data set and the amount of memory required to process it. It depends on the specific reporting you are doing.
In the case of this article, the output is 4 numbers:
Processing 10 items takes the same amount of memory as processing 10 billion items.
If the data set in this case was 50TB instead of a few GB, it would benefit from running the processing pipeline across many machines to increase the IO performance. You could still process everything on a single machine, it would just take longer.
Some other examples of large data sets+reports that don't require a large amount of memory to process:
Reports that require no grouping (like this chess example) or group things into buckets with a defined size (ports that are in a range of 1-65535) are easy to process on a single machine with simple data structures.
Now, as soon as you start reporting over more dimensions things become harder to process on a single machine, or at least, harder to process using simple data structures.
I kinda forget what point I was trying to make.. I guess.. Big data != Big report.
I generated a report the other day from a few TB of log data, but the report was basically
There's a lot of operational benefits to running on Hadoop/yarn as well. You get operational benefits from node resiliency (host went down? Run the application over there). You also get the Hadoop filesystem which conveniently stores your data in S3 and distributed HDFS.
These systems were designed by people who probably managed difficult etl pipelines that were nothing but what the author suggests: simplified shell scripts using UNIX pipes.
Besides going up against Hadoop MR is easy. I'd like to see you compete against something like Facebook's presto or spark which are optimized for network and memory.
What is the point? Who would want to use Hadoop for something below 10GB? Hadoop is not good at doing what it is not designed for? How useful.
Kind of depends on what the 10 GB is. For example, on my project, we started on files that were about 10 GB a day. The old system took 9 hours to enhance the data (add columns from other sources based on simple joins). So we did it with Hadoop on two Solaris boxes (18 virtual cores between them). Same data; 45 minutes. But wait there's more.
We then created a two fraud models that took that 10+ GB file (enhancement added about 10%) and executed within about 30 minutes a piece. But concurrently. All on Hadoop. All on arguably terrible hardware. Folks at Twitter and Facebook had never though about using Solaris.
We've continued this pattern. We've switched tooling from Pig to Cascading because Cascading works in the small (your PC without a cluster) and in the large. It's testable with JUnit in a timely manner (looking at you PigUnit). Now we have some 70 fraud models chewing over anywhere from that 10+ GB daily file set to 3 TB. All this in our little 50 node cluster. All within about 14 hours. Total processed data is about 50 TB a day.
As pointed out earlier, Hadoop provides an efficient, scalable, easy distributed application development platform. Cascading makes life very Unix-like (pipes and filters and aggregators). This coupled with a fully async eventing pipe line for workflows built on RabbitMQ makes for an infinitely expandable fraud detection system.
Since all processors communicate only through events and HDFS, we add new fraud models without necessarily dropping the entire system. New models may arrive daily, conform to a set structure, and are literally self-installed from a zip file within about 1 minute.
We used the same event + Hadoop architecture to add claim line edits. These are different from fraud models in that fraud models calculate multiple complex attributes then apply a heuristic to the claim lines. Edits look at a smaller operation scope. But in cascading this is pipe from HDFS -> filter for interesting claim lines -> buffer for denials -> pipe to HDFS output location.
Simple, scalable, logical, improvable, testable. I've seen all of these. As the community comes out with new tools, we get more options. My team is working on machine learning and graphs. Mahout and Giraph. Hard to do all of this easily with a home grown data processing system.
As always, research your needs. Don't get caught up in the hype of a new trend. Don't be closed minded either.
i agree that scalable infrastructure is needed to manage a production pipeline, as others have explained well.
i found this article was a useful reminder, because sometimes a job doesnt require a fully grown infrastructure. i commonly get these requests that dont overlap with existing infrastructure and wont need any followup. in that particular case a hadoop cluster, heck even loading into a pg db would be wasted effort.
but i wouldnt want to manage our clickstream analytics pipeline with shell scripts and cron jobs.
is there any lightweight tooling out there that can schedule/run basic pipeline jobs in a shell environment?
Airflow? It might not be what you consider lightweight, though.
Manta? Definitely not lightweight.
In my experience you can classify people into two herds: those who, when faced with a problem, solve it directly; and those who, faced with the same problem, try to fit it to the tools they want to use. I like to think this is a maturity question, but I can't think I've actually seen someone make the transition from the latter to the former type.
> those who, when faced with a problem, solve it directly
I think you mean "use the tools they already know".
> those who, faced with the same problem, try to fit it to the tools they want to use
I think you mean "use the correct tools for the job".
> I like to think this is a maturity question
It is, but the direction of maturity is from the first case to the second.
This is just plain click bait.
Obviously if a dataset is small enough to possibly fit in memory, it will be much faster to run on a single computer.
I think that's his point. Companies are chasing "big data" because it's a great buzzword without considering whether it's something they actually need or not.
A well-rounded, hype-resistant developer would look at the same problem and say, "wha? Nah, I'll just write a dozen lines of PowerShell and have the answer for you before you can even figure out AWS' terms of use..."
I don't think the article talks about this specifically but there's also a tendency to say "big data" when all you need is "statistically-significant data". If you're Netflix, if you just want to figure out how many users watch westerns for marketing purposes, you don't need to check literally every user, just a large enough sample so that you can get a 95% confidence or so. But I've seen a lot of companies use their "big data" tools to get answers to questions like that, even though it takes longer than just sampling the data in the first place.
(Now Netflix recommendations, that's a big data problem because each user on the platform needs individualized recommendations. But a lot of problems aren't. And it takes that well-rounded hype-resistant guy to know which are and which aren't.)
I guess the author should have called it out more explicitly for some, but I think that's the point.
I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.
Though sometimes hadoop makes sense even if local computation is faster. For example you might just be using hadoop for data replication.
1 reply →
The Hadoop article linked is available at https://web.archive.org/web/20140119221101/http://tomhayden3...?
Ok, now do it for >2tb.
Our prod hadoop dataset is now > 130tb, try that!
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools. If you have a huge amount of data or really need distributed processing, then tools like Hadoop may be required, but more often than not these days I see Hadoop used where a traditional relational database or other solutions would be far better in terms of performance, cost of implementation, and ongoing maintenance.
You can get 128GB DIMMs these days, so 2tb is easy to fit in memory. 130tb, yes that's a different story.
I am a bioinformatician. 130tb of raw reads or processed data? Are you trying to build a general purpose platform for all *-seq or focusing on something specific (genotyping)?
I think you might be replying to my comment. We just took delivery of a 20K WGS callset that is 30TB gzip compressed (about 240TB uncompressed) and expect something twice as big by the end of the year. We're trying to build something pretty general for variant level data (post calling, no reads), annotation and phenotype data. Currently we focus on QC, rare and common variant association and tools for studying rare disease. Everything is open source, we develop in the open and we're trying hard to make it easy for others to develop methods on our infrastructure. Feel free to email if you'd like to know more.
Note the magic words were "can be faster", not "are faster".
If you'd read the entire article you'd even have picked up that he's explicitly calling out use of hadoop for data that easily fits in memory, not large data sets.
What kind of data do you have? Is this mostly text, or more like compressed time series?
We're analyzing genome sequence data on that scale: https://github.com/hail-is/hail
It's long rows of lots of numbers that have to be crunched with each other in quite straightforward ways.
Some back of the hand calculations show it would take about 3 days using the article's method and a 2gbit pipe.
Out of curiosity, how long do you take to process 130tb on hadoop and where/how is the data stored?
It's about four hours on a on prem commodity cluster with ~PB raw storage on 22 nodes. Each node has 12 4TB disks (fat twins) and two xeons with (I think) 8 cores, and 256GB ram. It's got a dedicated 10GbE network with it's own switches.
The processing is a record addition per item (basically there is a new row for a matrix for every item we are examining, and an old row has disappeared) and a recalculation of aggregate statistics based on the old data and the new data - the aggregate numbers are then off loaded to a front end database for online use. The calculations are not rocket science so this is not compute bound in any way.
I think that we can do it because of data parallelism, the right data is available on each node and every core and every disk, so each pipeline just thumps away at ~50Mbs, there are about 300 of them so that's lots of crunching. At the moment I can't see why we can't scale more, although I believe that the max size of the app will be no more than *2 where we are now.
But what would happen if those exact same command-line tools were used inside a Hadoop node? What would be the optimum number of processors then?
That depends on the tradeoff between management/transfer overhead and actually doing work.
Always in the "word count" style examples, but quite often in real life, the "get the data into the process" takes more time than actually processing it.
When you need to distribute, you need to distribute. However, the point where "you need to distribute" is about 100x more data than the time most hadoop users do, and the overhead costs are far from negligible - in fact, they dominate everything until you get to 100x more data.
you would just be adding management-overhead.
More software != more efficient software.
But faster because parallel.
Yes, you do not need a 100 node cluster to crunch 1.75GB of data. I can do that on my phone. What's the author's point?!
That hadoop's reputation of being worth the hassle and complexity that managing it entails is undeserved.
Is there a place to download the database that isn't a self-extracting executable? (Seriously?)
Not just faster to run but also much faster to write
>cat | grep
why
(2014)
Here's the previous comments from the submission a couple years ago: https://news.ycombinator.com/item?id=8908462
Both disk and CPU failures are recoverable on expensive hardware
This article is a great litmus test for checking if someone has experience working at scale (Multi Terabytes, Multiple analysts, Multiple job types) or not. Anyone who has had that experience will instantly describe why this article is wrong. It's akin to saying a Tesla is faster than Boeing 777 on a 100 meter track.
I'd hope people who have worked at scale still are capable of recognizing when the tools they used there are totally overkill. I'd suspect they would, since they'd also be more aware of their limitations (vs somebody without experience, who has to believe the "you need big data and everything is easy" marketing).
That you wouldn't use a Boing 777 IF your problem is just a 100m track is the entire point of the article. It's explicitly not saying that you never should use the big tools.
They are not overkill at all, rather they are tuned towards different set of performance characteristics. E.g. in the Boeing 777 example above, transatlantic journey.
In the article above, the data and results stay on the local disk, however in any organization, they need to be stored in a distributed manner, available to multiple users with varying levels of technical expertise. Typically in NFS or HDFS, preferably if they are records stored/indexed via Hive/Presto. At which point the real issue is how do you reduce the delay resulting from transferring data over the network. Which is what the original idea (moving computation closer to data) behind Hadoop/MapReduce.
1 reply →