Comment by elcritch

9 years ago

That's pretty similar to my thinking on the performance. Though your point about the combination of shell script streaming and parallelization is a good way to express it.

The real benefit of this system would be compared to "traditional" (modern?) big data tools like spark, then the network latency cost of the reduce phases should be comparable. Though since manta localizes the compute to the data, there should be an overal order of magnitude less network transfer which should significantly reduce the of of manta based solutions compared to spark/s3 solutions.

In theory at least, it'd be great to test this on equivalent hardware, or at least equivalent;y priced hardware. But that would require a nice test data set which I don't have the resources to setup. Any suggestions on data code that could test the above assumptions would be handy (ahem HN peeps got anything?).

_Edits: grammar_

I got nothing on data code. You could try running a comparable S3/EC2 against Manta on Joyent, but that would be relatively expensive, and I have no idea of the differences between Amazon and Joyent's datacenter layout, so such a test would not be optimal, although it would test each in its most common use case.

It's also worth mentioning in performance analysis. that Manta is backed by ZFS and Zones, so it has the performance characteristics of those.