The 500-mile email (2002)

11 years ago (web.mit.edu)

This one comes up ever 3-4 years or so in sysadmin communities, and I read it every single time. because it's worth it.

It's one of those things that I highly doubt would have occurred to me to have even checked, or given even a moments thought to, under normal circumstances.

Another email incident at Microsoft worth reading [1].

[1] http://blogs.technet.com/b/exchange/archive/2004/04/08/10962...

  • One listserve (can't remember which) made up a list for people who complained like this instead of following the unsubscribe instructions. The admins would remove complainers from the normal lists and add them all to one mailing list, where the only emails they got were each others' demands to be taken off the mailing list, with unsubscribe instructions added to the beginning and the end of every single email.

  • Ha. There is no explanation of why the mailing lists were named "Bedlam" though, and I doubt non-native readers know what it refers to. To quote Wikipedia [0]:

    "Bedlam may refer to:

    Bethlem Royal Hospital, London hospital first to specialise in the mentally ill and origin of the word "bedlam" describing chaos or madness"

    [0] http://en.wikipedia.org/wiki/Bedlam

    • I'm a non-native speaker and I know what Bedlam means. Thanks to Ultima Online and Diablo :)

  • I also found that to be evidence of pretty horrific architecture in Exchange. Two actual recipient lists with a secret internal one? Bloating headers to 13K? At the very least, it seems to me like they chose to put the distribution logic at the wrong layer...

    • > Two actual recipient lists with a secret internal one?

      How else do you propose handling BCC and mailing lists?

  • Ah yes, the age old "reply-all" email storm.

    The bit about the recipient processing bug is novel tough, ouch.

If only every bug report that I received had been processed by a geostatistician... Usually I get a "hey, I can't get X to work". One of three responses from me usually fixes it: "Is your computer on?", "are you online?", and "try hitting refresh".

I am actually surprised the sysadmin in this scenario thought it was a bad thing that the statistics department did their research and presented a well documented error.

  • Well, technically, the geostatistician (Did I spell that right?) was doing research that was orthogonal to the actual problem and its symptoms. In this case, the results were sufficiently odd that they sort of pointed in the right direction, but I've been sent off on wild goose chases by people skillfully applying their own particular set of skills before.

    On the other hand, there's the word document with nothing but a screen shot showing half of a useless error message.

  • Reminds me... when I post a support request to Google Apps, the issue description header says "in as much detail as possible"... but the field is limited to 1000 characters. When you're dealing with anything other than simple first-level support issues, a user simply can't put in a usefully descriptive amount of detail...

I didn't know about the units program. Is there any resource out there that lists these little *nix utility programs?

Shouldn't this account for a round trip, and the speed through copper (~ 2/3rd of the speed of light)? That would lower the radius to much more than 500 miles.

  • I had this thought when reading this before as well. I imagine that the "3 milliseconds" they determined from testing was a typical number, maybe the median/mean, and that the actual timeout varied considerably depending on CPU load at that particular moment. Add in a number of retries for the server to attempt sending each email, and the effective timeout might have been a few milliseconds more... or at least it must have been, because `(2 * 500 miles) / (2/3 speed of light)` works out to about 8 milliseconds (where the 2X is for the round trip, and 2/3 is a rough multiplier for the speed of light traveling in either copper or optical fiber).

Another of the 10,000 here - this is such a delightful story.

Also just discovered the "units" conversion program and disappointed that the default Mac library has only 586 units. And shockingly there don't seem to be compatible libraries out there.

I was so happy to discover that units command line program, then i realized that Google already does this, it just wasn't as fun.

Since I've seen a few comments about units not having lightseconds so here are a few ways to add the missing unit if you don't have it.

1) Add this line under the lightyear definition in /usr/share/misc/units.lib (or wherever `man units` says the standard units library is under the FILES section)

    lightsecond lightyear / 365.25 / 24 / 60 / 60

2) If you're on a mac and use homebrew just `brew install gnu-units` and then run `gunits`

  • That's the speed of light in a vacuum ... through fiber-optic cable the speed of light is about two-thirds that value.

Damn statisticians. They do know their job quite well.

  • It was a seriously accurate bug report. If only all users were so thoughtful.

    • > If only all users were so thoughtful.

      But then it sent him off in a direction not worth going. He literally started to map out how far emails would go if they succeeded. The whole time the error was in the timeout instead.

      2 replies →

Who though it was going to be a TTL issue before finishing reading the story? :)

  • You probably mean something else (RTT?) but definitely not TTL, which is a completely different thing :)

    • TTL is involved when dealing with routed networks. The farther the destiny, you normally get more hops on the way. If the starting TTL is low, you won't reach the destiny. So, TTL values cause problems like this, although the radius wouldn't be so precise. Damn statisticians!

      1 reply →

  • Not I, I don't know why but I was thinking IP version issue but that makes no sense.

Why do I get:

> unknown unit 'millilightseconds'

Is this one of the embellishments that just makes the story more entertaining?

  • Not an embellishment at all.

    Via 'man units': "The conversion information is read from a units data file that is called 'definitions.units' and is usually located in the '/usr/share/units' directory."

    Via definitions.units (L. 223), you can see the milli- prefix: https://gist.github.com/anonymous/f06769de95e0c7f9e658#file-...

    Via deifnitions.units (L. 1060), you can see the lightsecond unit: https://gist.github.com/anonymous/f06769de95e0c7f9e658#file-...

    Maybe check it for completeness?

    Edit: Spelling

    • Some distributions only support lightyear so adding this line to your units file (which you can find with man units) will give you support for *lightseconds:

      lightsecond lightyear / 365.25 / 24 / 60 / 60

      1 reply →

  • I had the same thing happen to me. From the manpage I gathered that units uses the definitions defined in /usr/share/misc/units.lib, by running cat /usr/share/misc/units.lib | grep light I found I only had lightyear and it's shortcut ly defined. I added lightsecond, and since milli prefix is already defined it worked a treat.

    Here's the line you'll want to add:

    lightsecond lightyear / 365.25 / 24 / 60 / 60

  • If you're on a mac, try $ brew install gnu-units - it's probably using a very incomplete library of units.

  • More complete units library. Note how the original author's units has 1311 units and 63 prefixes, OSX only has 586 and 56.

Absolytely a good reading. Sometimes this kind of readings can help in a complete different problem. Sometime happens you are dealing with another problem, then you remember this story, and you figure out what's wrong because there're some similarities. I remember to have fixed a problem with Postgresql remembering a story about Unicode and Postfix, different domain, but similar problem.

That was great out-of-the-box thinking, and I wonder if that could be used as one of these job interview questions:

Q: "Your email server for some reason is only working for addresses within 500 miles of the server. What may go wrong?"

And let the candidate think logically and reach some sane answer, even if not 100% accurate (i.e. check routers first, connectivity, DNS, timeouts...)

If you're a sysadmin and someone brings in a consultant who gets root access and upgrades the whole OS to a new operating system which then almost takes out email.. wouldn't that be a problem?

If I were the sysadmin and that happened, I would need to have a meeting with some people. What's the point of being a sysadmin if he operating system is randomly going to be completely changed without someone telling you?

I have a fair amount of built up rage. This seems like one of those situations where it is actually your responsibility to rip people a new one.

A perfect answer to the YC application question - "Tell us something surprising or amusing that one of you has discovered" :)

Every time I read this I am reminded of units(1) util, which is super useful and I always forget about and revert to Google. But yeah, that connect timeout to 500 mi correlation is fun too.

1 year ago: https://news.ycombinator.com/item?id=123489

  • Once a year is about the right frequency. Recurring stories is one way in which a community shares and perpetuates its culture with newcomers. Some of them are a delight to read on that yearly cadence, like the SR-71 story about a pilot and his copilot becoming a crew.

    That said, it's wise to consider the frequency with which such things appear, individually and in total. Too much repetition and focus on memes becomes dysfunctionally self-obsessive. Not sure what the right answer is, but I can probably deal with once per year, short time on front page, and small % of total content.

    • This is an interesting idea. Have a system where a community can mark something as important, and to have it automatically reposted at preset intervals. Community members could be allowed to additionally repost, or the system can politely say it's already archived and will be shared again on such & such date. Use it as a way to reinforce community history.

      4 replies →

  • Maybe the submitter is one of the ten thousand https://xkcd.com/1053/

    Bet there are a few more that will find this submission too.

    • This doesn't apply very well... HN is heavily archived... this comic is about being rude to people for not knowing about something, not justifying shoving the same cyclical content in people's faces repeatedly.

  • Wow, I must have bad timing. I've had an account here for almost all of those, and I think I was probably lurking for the 1 or 2 occurrences when I did not have an account, but don't remember seeing it before.

    Or perhaps senility is setting in early. :-D

  • The only significant discussion was almost five years ago. Or about the time the first iPads went on sale. And before either of us were members.

    I missed it all the other times and am glad it was reposted.

  • I don't think this is a bad thing. It was either 1 or 2 years ago when I first read about this - newcomers to the community have to find out about things in one way or another.

Can we get a nice "HN Classic" tag to put beside annual stories like this? I'm fine if stories like this pop up every year, actually.

> And also being a good system administrator, I had written a sendmail.cf [...]

Say what? Nobody writes a sendmail.cf from scratch, unless they are crazy.

> ... that used the nice long self-documenting option and variable names available in Sendmail 8 rather than the cryptic punctuation-mark codes that had been used in Sendmail 5

Good system administrators stick to conservative, portable subsets of configuration and scripting languages, rather than bleeding edge stuff.

When they deviate, they have a clear plan. They document their choice to use something new and shiny, and they keep it separated from the default system configuration.

Since SunOS came with Sendmail 5, the upgraded Sendmail 8 should have been installed in some custom location with its own path so that it coexists with the stock Sendmail, and is not perturbed if the OS happens to upgrade that.

A good syadmin would stick that in some /usr/local/bin type local directory, and not overwrite /usr/bin/sendmail.

The consultant was not wrong to update the OS. People have reasons to do that. The consultant should have consulted with the sysadmin, of course. But even in that event, it might not have immediately occurred to the sysadmin what the implication would be to the sendmail setup.

  • Goodness, you're determined to find fault, aren't you? (For the record in re your comment later about my "basis to call [myself] a good system admin", those claims were a) jokey, and b) fairly well-substantiated by my reputation by that time, I should think. I was published by that point and had been on several conference committees along with many who'd be reading that mailing list; I hardly needed to peacock like you seem to think I was doing.)

    But I think your criticisms seem a little uninformed (or possibly over-informed by later practice to the point where you aren't considering this in the context of mid-1990's practice). Let's see...

    > > And also being a good system administrator, I had written a sendmail.cf [...]

    > Say what? Nobody writes a sendmail.cf from scratch, unless they are crazy.

    I didn't say "from scratch". I used the m4 macros to create a cf, like everyone did at the time. Using the default file would only work if you still used email programs that read raw mbox files, had no email lists, and needed no interesting aliasing or vacation script behavior. Oh, and ran in an environment where it was reasonable to assume someone's canonical email address could be found via the equivalent of "echo "${USER}@${HOST#.}".

    Very few production systems could get away with that; writing a sendmail.cf was standard practice. And with m4, you usually spoke of "writing" a file where today we'd call it "configuring" a file; either way it was taking boilerplate and replacing bits with things that were right for your situation. I assume you wouldn't have had an issue with my writing that I'd "configured" the sendmail.cf. That's all I did.

    > > ... that used the nice long self-documenting option and variable names available in Sendmail 8 rather than the cryptic punctuation-mark codes that had been used in Sendmail 5

    > Good system administrators stick to conservative, portable subsets of configuration and scripting languages, rather than bleeding edge stuff.

    Hmm, you either weren't administering SunOS in the mid-90's or you're forgetting some details. SunOS still came with Sendmail 5 years* after best practice was to use Sendmail 8. Check out the O'Reilly Sendmail book of the time's pagecount: it was longer than the prior and the later versions because it had to document both. I'm not entirely certain SunOS (as opposed to Solaris) ever was upgraded to Sendmail 8 in the distribution; obviously the people using SunOS still so late were change-averse.

    "Bleeding edge" != "the version that all but the most conservative holdouts are using". Also, remember that this was the same period we were doing the rsh/rlogin conversion to SSH. Sendmail 5 still had known security issues that were fixed in Sendmail 8. We were used to replacing system components when what the OS vendor was shipping us was literally dangerous to run.

    And Sendmail 8's Sendmail 5 compatibility mode was simply there for testing; it was never intended to be used production long-term, so using a least-common-denominator sendmail.cf wouldn't have been "conservative and portable"; it would have been risky, bordering on malpractice.

    > Since SunOS came with Sendmail 5, the upgraded Sendmail 8 should have been installed in some custom location with its own path so that it coexists with the stock Sendmail, and is not perturbed if the OS happens to upgrade that. > A good syadmin would stick that in some /usr/local/bin type local directory, and not overwrite /usr/bin/sendmail.

    Again, either you didn't run this installation in the mid-90's or you're forgetting some details. /usr/lib/sendmail (notice the "lib"! Your referring to "/usr/bin/sendmail" suggests to me you definitely weren't running SunOS 4 or have forgotten details; sendmail was never in /usr/bin) couldn't be left alone, as other tools hardcoded that path. The actual executable was there, so symlinking couldn't be used to get around that.

  • > Say what? Nobody writes a sendmail.cf from scratch, unless they are crazy. The point moreover was that he had a custom version of the config file (not just default).

    • Yes, sites have necessary customizations in sendmail.cf. These do not have to be rewrites that use shiny new syntax.

      My biggest problem with the author was not that he uses his admin blunders as a basis to call himself a good sysadmin, but that he assumed that the stats people were idiots who don't know anything about `puters or networks.

      I was not surprised by the 500 mile claim. It strikes me as obvious that the 500 miles has to do with some combination of network topology and propagation delays, those being approximately the same in every direction.

      Yes, networking does work "that way": farther places take more time to reach than nearer ones, broadly speaking. (Of course, it's faster to reach something 12,000 km away with no packet switch in between than something 50 miles away with switching. That doesn't eliminate the generality.)

      It was also obvious why they didn't report the problem instantly; you cannot instantly know that mail isn't reaching beyond 500 miles without gathering data and correlating to a map, which takes time. Instantly, you can only know data points like "I can't mail to users@example.com". You know that if a stats person gives you a number, it was based on data, and not just a couple of data points. The head of the stats department isn't going to give you a number that isn't factual and backed by science. Of course stats people pride themselves on their data analysis; they are not just going to relay a couple of data points with no analysis attached.

      1 reply →