Comment by elcritch

9 years ago

Actually, just posted essentially the same thing, before reading your comment. I'm wondering as well how the performance would/will scale. It likely depends on how the data is scattered / replicated, but presumable they've worked out decent schedulers for the system. If not, it is open source! Lovin it.

Well, all the code running against the data would already have the paralellization advantages of a shell script, as described in this article. It would additionally probably be running accross multiple nodes, meaning that the IO speeds increase the number of records that can be processe)d simultaneously. The disadvantage is that that data has to be streamed over a network to the reducer node, which could add a good chunk of latency, depending on how fast that is (if you can do some reduction during the map, it would help, but it's possible that Manta spawns one process and virtualized nods per object (and indeed, this is likely), meaning this is impossible), and how many virtual nodes are running on the same physical hardware (but then you're running into the same boundaries you hit on a laptop, just on a much beefier system), as the network latency is near zero if the reducer and the mapper nodes are on the same physical system.

But if you're processing terrabytes, the network latency is probably barely factoring into your considerations, given how much time you're saving by processing data in parallel in the first place.

  • That's pretty similar to my thinking on the performance. Though your point about the combination of shell script streaming and parallelization is a good way to express it.

    The real benefit of this system would be compared to "traditional" (modern?) big data tools like spark, then the network latency cost of the reduce phases should be comparable. Though since manta localizes the compute to the data, there should be an overal order of magnitude less network transfer which should significantly reduce the of of manta based solutions compared to spark/s3 solutions.

    In theory at least, it'd be great to test this on equivalent hardware, or at least equivalent;y priced hardware. But that would require a nice test data set which I don't have the resources to setup. Any suggestions on data code that could test the above assumptions would be handy (ahem HN peeps got anything?).

    _Edits: grammar_

    • I got nothing on data code. You could try running a comparable S3/EC2 against Manta on Joyent, but that would be relatively expensive, and I have no idea of the differences between Amazon and Joyent's datacenter layout, so such a test would not be optimal, although it would test each in its most common use case.

      It's also worth mentioning in performance analysis. that Manta is backed by ZFS and Zones, so it has the performance characteristics of those.