Comment by lmm

9 years ago

An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.

But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform". If I want to run a simple pipeline of UNIX tools on a random system, the worst case prerequisite scenario is reclaiming disk space and installing less common tools (e.g. mawk).

  • > But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform".

    There are often subtle incompatibilities between the versions of those tools found on different systems (e.g. GNU vs BSD). Worse, there's no test suite or way to declare the dependency, and you may not even notice until the job fails halfway through. Whereas on a Spark job built with maven or similar the dependencies are at least explicit and very easy to reproduce.

    "Exotic" is relative. At my current job I would expect a greater proportion of my colleagues to know Spark than gawk. Unix is a platform too - and an underspecified one with poor release/version discipline at that.

    In an organization where Unix is standard and widely understood and Spark is not, use Unix; where the reverse applies, it's often more maintainable to use Spark, even if you don't need distribution.

    But my original point was: if you need a little more scalability than Unix offers then it may well be worth going all the way to Spark (even though it's far more scalability than you likely need) rather than hacking up some custom job-specific solution that scales just as far as you need, just because Spark is standard, documented and all that.

    • You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number.

      It's not convenient, and it could be done better, but it is by no means impossible.

      Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.

      4 replies →

    • You lost me when you put "built with Maven" and "easy to reproduce" in the same paragraph.

      Thinking of it, having to "build" (and presumably deploy on multiple application servers) a simple data processing job is a sign of enterprisey overcomplication regardless of the quality of the underlying technology.

      1 reply →