Comment by aub3bhat

9 years ago

The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.

I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that.

7 comments

aub3bhat

frankmcsherry 9 years ago

> The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010.

We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.

> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.

The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.

And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).

aub3bhat 9 years ago
"The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this."
Thanks, that addresses my concern. I take back my comment.
But why stop at Rust implementation, there are vendors optimizing it down to FPGA. This sort of comparison is hardly meaningful.
- frankmcsherry 9 years ago
  
  The only point of the paper is that the previous publications sold their systems primarily on performance, but their performance arguments had gaping holes.
  The C# and Rust implementations have the property that they are easy and you don't need to have any specific skills to write a for-loop the way we did (the only "tricks" we used were large pages and unbuffered io in C#, and mmap in Rust).
  The point is absolutely not that these are the final (or any) word in these sorts of computations; if you really care about performance, use FPGAs, ASICs, whatever. There will always be someone else doing it better than you, but we thought it would be nice if that person wasn't a CS 101 undergraduate typing in what would literally be the very first thing they thought of.
nickpsecurity 9 years ago

It's a great paper. I really enjoyed it. Keep hitting them with reality checks they need! :)

pjmlp 9 years ago

How many companies out there playing with big data are at least half of the size of Twitter?

aub3bhat 9 years ago
You don't need to be "half the size of Twitter". What does that even mean, in headcount, in TB stored, half of the snapshot they used?
The benefits of using a distributed/hadoop style approach to managing your data assets becomes evident as soon as you have more than 5 employees who access such systems. Unless your workload is highly specific, e.g. in Deep Learning, it makes total sense to use a single machine with as many GPUs as possible.
Let me clarify that I used the exact snapshot, in 2012 (here is post that was even cited by few papers [0]) , However I knew that reality of using this data was far complex, and even though you can write "faster" programs on your laptop (I used GraphLab) than a cluster (I had access to 50 nodes Cornell cluster), it didn't mean much.
[0] https://scholar.google.com/citations?view_op=view_citation&h...
- pjmlp 9 years ago
  
  Back when I was working for telecommunications (long time ago), operators had GB of data coming out of network elements all back into the network management systems.
  That data was handled pretty well with Oracle OLAP in HP-UX servers.
  I don't work with big data, but get to see some of the RFPs we get, and most of them are in the scenario of 2016 laptop being able to process the data.