← Back to context

Comment by zzzcpan

9 years ago

There is nothing preventing distributed systems to be faster than one box for this kind of thing. But they don't always bother to pursue efficiency on that level, because things are very different once you have a lot of boxes and something that used to look important for a couple of boxes doesn't anymore.

Yes, there is, you have a lot of overhead in any case for the same tools.

  • You don't have the same tools. You are probably thinking about emulating POSIX filesystem API and things like that and using those command-line tools on top of that in a single-box kind of way. That's not how you treat your distributed system.

    EDIT: For something that beats a single box easily I envision an interpreter with JIT running on each node in a distributed system and on the same process that stores data, having pretty much no overhead to access and process it.

    • >You are probably thinking about emulating POSIX filesystem API and things like that and using those command-line tools on top of that in a single-box kind of way. That's not how you treat your distributed system.

      Yeah, but Manta's mapreduce does something close, and it seems to work okay.

Fancy highly-scalable distributed algorithms have that annoying tendency of starting at 10x slower than the most naïve single-machine algorithm.