Comment by TickleSteve

9 years ago

No.

Show that you need scalability first. Chances are you don't.

When you do, scale the smallest possible part of your system that is the bottleneck, not the whole thing.

10 comments

TickleSteve

An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.

HelloNurse 9 years ago
But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform". If I want to run a simple pipeline of UNIX tools on a random system, the worst case prerequisite scenario is reclaiming disk space and installing less common tools (e.g. mawk).
- lmm 9 years ago
  
  > But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform".
  There are often subtle incompatibilities between the versions of those tools found on different systems (e.g. GNU vs BSD). Worse, there's no test suite or way to declare the dependency, and you may not even notice until the job fails halfway through. Whereas on a Spark job built with maven or similar the dependencies are at least explicit and very easy to reproduce.
  "Exotic" is relative. At my current job I would expect a greater proportion of my colleagues to know Spark than gawk. Unix is a platform too - and an underspecified one with poor release/version discipline at that.
  In an organization where Unix is standard and widely understood and Spark is not, use Unix; where the reverse applies, it's often more maintainable to use Spark, even if you don't need distribution.
  But my original point was: if you need a little more scalability than Unix offers then it may well be worth going all the way to Spark (even though it's far more scalability than you likely need) rather than hacking up some custom job-specific solution that scales just as far as you need, just because Spark is standard, documented and all that.
  
  7 replies →