Comment by qwertyuiop924
9 years ago
...And when you do have 140 TB of chess data, you can move to Manta, and you get to keep your data processing pipeline almost exactly the same. Upwards scalability!
I don't know how the performance would stack against Hadoop, but it'd work.
Manta storage service teaser: https://www.youtube.com/watch?v=d2KQ2SQLQgg
In about 20,000 years the chess DB will get that big. Until then grep should be fine.
Actually, just posted essentially the same thing, before reading your comment. I'm wondering as well how the performance would/will scale. It likely depends on how the data is scattered / replicated, but presumable they've worked out decent schedulers for the system. If not, it is open source! Lovin it.
Well, all the code running against the data would already have the paralellization advantages of a shell script, as described in this article. It would additionally probably be running accross multiple nodes, meaning that the IO speeds increase the number of records that can be processe)d simultaneously. The disadvantage is that that data has to be streamed over a network to the reducer node, which could add a good chunk of latency, depending on how fast that is (if you can do some reduction during the map, it would help, but it's possible that Manta spawns one process and virtualized nods per object (and indeed, this is likely), meaning this is impossible), and how many virtual nodes are running on the same physical hardware (but then you're running into the same boundaries you hit on a laptop, just on a much beefier system), as the network latency is near zero if the reducer and the mapper nodes are on the same physical system.
But if you're processing terrabytes, the network latency is probably barely factoring into your considerations, given how much time you're saving by processing data in parallel in the first place.
That's pretty similar to my thinking on the performance. Though your point about the combination of shell script streaming and parallelization is a good way to express it.
The real benefit of this system would be compared to "traditional" (modern?) big data tools like spark, then the network latency cost of the reduce phases should be comparable. Though since manta localizes the compute to the data, there should be an overal order of magnitude less network transfer which should significantly reduce the of of manta based solutions compared to spark/s3 solutions.
In theory at least, it'd be great to test this on equivalent hardware, or at least equivalent;y priced hardware. But that would require a nice test data set which I don't have the resources to setup. Any suggestions on data code that could test the above assumptions would be handy (ahem HN peeps got anything?).
_Edits: grammar_
1 reply →