Comment by aub3bhat

9 years ago

You don't need to be "half the size of Twitter". What does that even mean, in headcount, in TB stored, half of the snapshot they used?

The benefits of using a distributed/hadoop style approach to managing your data assets becomes evident as soon as you have more than 5 employees who access such systems. Unless your workload is highly specific, e.g. in Deep Learning, it makes total sense to use a single machine with as many GPUs as possible.

Let me clarify that I used the exact snapshot, in 2012 (here is post that was even cited by few papers [0]) , However I knew that reality of using this data was far complex, and even though you can write "faster" programs on your laptop (I used GraphLab) than a cluster (I had access to 50 nodes Cornell cluster), it didn't mean much.

[0] https://scholar.google.com/citations?view_op=view_citation&h...

Back when I was working for telecommunications (long time ago), operators had GB of data coming out of network elements all back into the network management systems.

That data was handled pretty well with Oracle OLAP in HP-UX servers.

I don't work with big data, but get to see some of the RFPs we get, and most of them are in the scenario of 2016 laptop being able to process the data.