Comment by HelloNurse

9 years ago

But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform". If I want to run a simple pipeline of UNIX tools on a random system, the worst case prerequisite scenario is reclaiming disk space and installing less common tools (e.g. mawk).

8 comments

HelloNurse

lmm 9 years ago

> But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform".

There are often subtle incompatibilities between the versions of those tools found on different systems (e.g. GNU vs BSD). Worse, there's no test suite or way to declare the dependency, and you may not even notice until the job fails halfway through. Whereas on a Spark job built with maven or similar the dependencies are at least explicit and very easy to reproduce.

"Exotic" is relative. At my current job I would expect a greater proportion of my colleagues to know Spark than gawk. Unix is a platform too - and an underspecified one with poor release/version discipline at that.

In an organization where Unix is standard and widely understood and Spark is not, use Unix; where the reverse applies, it's often more maintainable to use Spark, even if you don't need distribution.

But my original point was: if you need a little more scalability than Unix offers then it may well be worth going all the way to Spark (even though it's far more scalability than you likely need) rather than hacking up some custom job-specific solution that scales just as far as you need, just because Spark is standard, documented and all that.

qwertyuiop924 9 years ago
You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number.
It's not convenient, and it could be done better, but it is by no means impossible.
Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.
- lmm 9 years ago
  
  > You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number. > It's not convenient, and it could be done better, but it is by no means impossible.
  In principle it may be possible, but in practice it's vaporware at best. There's no standard, established way to do this, which means there's no way that other people could maintain - the closest is probably autoconf, but I never saw anyone ship an aclocal.m4 with their shell script. Since unix utilities tend to be installed system-wide, it's not really practical to develop or test (not that the unix platform is at all test-friendly to start with) against old target versions (docker may eventually get there, but it's currently immature) - if your development machine has version 2.6 of some utility, you'll probably end up accidentally using 2.6-only features without noticing.
  > Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.
  My organization isn't primarily-unix, and people don't use awk often enough to learn it. I'm sure one could pick up the basics fairly quickly, but that's true of Spark too.
  
  3 replies →
HelloNurse 9 years ago
You lost me when you put "built with Maven" and "easy to reproduce" in the same paragraph.
Thinking of it, having to "build" (and presumably deploy on multiple application servers) a simple data processing job is a sign of enterprisey overcomplication regardless of the quality of the underlying technology.
- lmm 9 years ago
  
  Unless it's a bootable image (unikernel), your "simple" job has dependencies. Better to have them visible, explicit and resolved as part of a release process rather than invisible, implicit, and manually updated by the sysadmin whenever they feel like it.