Comment by deadgrey19
9 years ago
This idea was the subject of a paper at a major systems conference. The paper is called "Scalability! But at what cost?" - It goes well beyond this simple example above to explore how most major systems papers produce results that can be beaten by a single laptop. Here's the paper and the blog post describing it.
http://www.frankmcsherry.org/assets/COST.pdf
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
I love this. I hope the 'COST' measurement takes off.
I'm not going to go as far as some in condemning the latest frameworks, but I do agree that they are often chosen with no concept of the overhead imposed by being distributed.
Is there anything similar comparing 'old school' distributed frameworks like MPI to the new ones like Spark. I'm curious how much of the overhead is due to being distributed, network latency and Amdel's law, versus the overhead from the much higher level, and more productive, framework itself.
They use Rust, fantastic! Will put this on my must-read-list. Based on their graphs, it makes one wonder how much literal energy has been wasted using 'scalable' but suboptimal solutions... Of course if you're wishing to start a company competing on data processing (e.g. small IoT startups), being a bit cleverer could let you have the same performance or feature set with 1/10th the overhead costs. So maybe don't let too many people know? ;)
The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that.
> The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010.
We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.
> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.
And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).
"The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this."
Thanks, that addresses my concern. I take back my comment.
But why stop at Rust implementation, there are vendors optimizing it down to FPGA. This sort of comparison is hardly meaningful.
1 reply →
It's a great paper. I really enjoyed it. Keep hitting them with reality checks they need! :)
How many companies out there playing with big data are at least half of the size of Twitter?
You don't need to be "half the size of Twitter". What does that even mean, in headcount, in TB stored, half of the snapshot they used?
The benefits of using a distributed/hadoop style approach to managing your data assets becomes evident as soon as you have more than 5 employees who access such systems. Unless your workload is highly specific, e.g. in Deep Learning, it makes total sense to use a single machine with as many GPUs as possible.
Let me clarify that I used the exact snapshot, in 2012 (here is post that was even cited by few papers [0]) , However I knew that reality of using this data was far complex, and even though you can write "faster" programs on your laptop (I used GraphLab) than a cluster (I had access to 50 nodes Cornell cluster), it didn't mean much.
[0] https://scholar.google.com/citations?view_op=view_citation&h...
1 reply →