Comment by edude03
9 years ago
Arguable if you can keep everything on one box it will almost always be faster (and cheaper!) than any soft of distributed system. That said, scalability is generally more important than speed because once a task can be distributed you can add performance by adding hardware. As well, depending on your use case you can often get fault tolerance "for free".
No.
Show that you need scalability first. Chances are you don't.
When you do, scale the smallest possible part of your system that is the bottleneck, not the whole thing.
An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.
But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform". If I want to run a simple pipeline of UNIX tools on a random system, the worst case prerequisite scenario is reclaiming disk space and installing less common tools (e.g. mawk).
8 replies →
My hadoop experience is dated (circa 2011), do the work nodes still poll the scheduler to see if they have work to do? If so, that's still a giant impediment to speed for smaller tasks. Especially if poll times are in the range of minutes.
If hadoop put effort into making small tasks time efficient, I think your argument has merit, if there's a reasonable chance of actually needing to scale, or to pick up ancillary benefits (fault tolerance, access to other data that needs to be processed with hadoop etc)
There is nothing preventing distributed systems to be faster than one box for this kind of thing. But they don't always bother to pursue efficiency on that level, because things are very different once you have a lot of boxes and something that used to look important for a couple of boxes doesn't anymore.
Yes, there is, you have a lot of overhead in any case for the same tools.
You don't have the same tools. You are probably thinking about emulating POSIX filesystem API and things like that and using those command-line tools on top of that in a single-box kind of way. That's not how you treat your distributed system.
EDIT: For something that beats a single box easily I envision an interpreter with JIT running on each node in a distributed system and on the same process that stores data, having pretty much no overhead to access and process it.
1 reply →
Fancy highly-scalable distributed algorithms have that annoying tendency of starting at 10x slower than the most naïve single-machine algorithm.