Comment by lmm

9 years ago

> But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform".

There are often subtle incompatibilities between the versions of those tools found on different systems (e.g. GNU vs BSD). Worse, there's no test suite or way to declare the dependency, and you may not even notice until the job fails halfway through. Whereas on a Spark job built with maven or similar the dependencies are at least explicit and very easy to reproduce.

"Exotic" is relative. At my current job I would expect a greater proportion of my colleagues to know Spark than gawk. Unix is a platform too - and an underspecified one with poor release/version discipline at that.

In an organization where Unix is standard and widely understood and Spark is not, use Unix; where the reverse applies, it's often more maintainable to use Spark, even if you don't need distribution.

But my original point was: if you need a little more scalability than Unix offers then it may well be worth going all the way to Spark (even though it's far more scalability than you likely need) rather than hacking up some custom job-specific solution that scales just as far as you need, just because Spark is standard, documented and all that.

7 comments

lmm

qwertyuiop924 9 years ago

You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number.

It's not convenient, and it could be done better, but it is by no means impossible.

Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.

lmm 9 years ago
> You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number. > It's not convenient, and it could be done better, but it is by no means impossible.
In principle it may be possible, but in practice it's vaporware at best. There's no standard, established way to do this, which means there's no way that other people could maintain - the closest is probably autoconf, but I never saw anyone ship an aclocal.m4 with their shell script. Since unix utilities tend to be installed system-wide, it's not really practical to develop or test (not that the unix platform is at all test-friendly to start with) against old target versions (docker may eventually get there, but it's currently immature) - if your development machine has version 2.6 of some utility, you'll probably end up accidentally using 2.6-only features without noticing.
> Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.
My organization isn't primarily-unix, and people don't use awk often enough to learn it. I'm sure one could pick up the basics fairly quickly, but that's true of Spark too.
- qwertyuiop924 9 years ago
  
  Okay, I guess.
  For utilities, it's usually safe to assume that whoever is running it is running a comparable version - the most commonly used options are decades old at this point, and it's relatively unlikely you'll use the new stuff.
  As for checking for GNU tooling, a grep against <tool> -v does the trick. This can also get you the version number. You can probably even write a command to do it.
  It's nonstandard and suboptimal, but it is, once again, possible.
  
  2 replies →

HelloNurse 9 years ago

You lost me when you put "built with Maven" and "easy to reproduce" in the same paragraph.

Thinking of it, having to "build" (and presumably deploy on multiple application servers) a simple data processing job is a sign of enterprisey overcomplication regardless of the quality of the underlying technology.

lmm 9 years ago

Unless it's a bootable image (unikernel), your "simple" job has dependencies. Better to have them visible, explicit and resolved as part of a release process rather than invisible, implicit, and manually updated by the sysadmin whenever they feel like it.