Comment by jcrites
9 years ago
> [...] the benefit of having as much of your data processing jobs in one ecosystem as possible. [...] The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.
Yep! Elasticity is a pretty nice benefit.
Sure, if you're processing a few gigabytes of data, then you could do that with shell scripts on a single machine. However, if you want to build a system that you can "set and forget", that will continue to run over time as data sizes grow arbitrarily, and that -- as you say -- can be fault tolerant, then distributed systems are nice for that purpose. The same job that handles the few gigabytes of data can scale to petabytes if needed. The same techniques that handle gigabytes scale to petabytes.
A job running on a single machine with shell scripts will eventually reach a limit where the data size exceeds what it can handle reasonably. I've seen this happen repeatedly first hand, to the extent that I'd be reluctant to use this approach in production unless I needed something really quick-n-dirty where scaling isn't a concern at all. Another problem with these single-machine solutions is their reliability. If it's for production use, you really want seamless, no-humans-involved failover, which isn't as straightforward to achieve with the single-machine approach unless you deploy somewhat specialized technology (it ends up being something like primary/standby with network attached storage).
Plus, in an environment where you have one job processing GiBs of data, you tend to have more. While any single solo job handling GiBs of data could be done locally, once you have a lot of them, accessed by many different people at a company and under different workflows, the value of distributed data infrastructure starts to make more sense.
Neat article though. Always good to have multiple techniques up your sleeve, to use the right one for the problem at hand.
No comments yet
Contribute on Hacker News ↗