Can an email go 500 miles in 2025?

7 months ago (flak.tedunangst.com)

https://archive.ph/h1AhH

If you’re one of today’s lucky 10,000 and haven’t heard the original 500-mile email story, you can read it at https://news.ycombinator.com/item?id=9338708)

Reading the title and knowing exactly what this is about kind of makes me feel old to be honest.

  • If it makes you feel better, I'm so old I read the title and 3/4 of the original story before I realised I'd read it before.

  • I think this is enough of a classic to be widely known even among younger people. I'm 23 (doing math msc) and I think all the CS people that I know would instantly recognize the 500 miles title.

    Though I do somewhat envy the possibility of having read the article close to publication and feel in some sense part of the history when it crops up again like this.

  • > Reading the title and knowing exactly what this is about kind of makes me feel old to be honest.

    Let's go for experienced and ready to educate the young'uns.

I thought this was about consolidation of email providers so your email never leaves a single datacenter:

"10 years ago we couldn't send an email 500 miles, but these days we can't send it 500 miles because it just routes internally."

Too bad, I think that would have been more interesting to read.

  • This is the first roadblock the author runs into - lots of universities ping at <2ms, likely because everyone's on the same datacenter.

> There’s a lot to the story that’s obviously made up...

Obviously? I think I've had this phone call myself a few times, although in my experience it was never from a statistician and they didn't give me as much data, but I'm pretty sure the story is mostly accurate.

> I think this is nonsense... why would an invalid or incomplete sendmail configuration default to three milliseconds?

This is a wonderful question, and perhaps much more interesting than anything else in the page, but first, let's reproduce the timing;

My desktop, a 2017 Xeon E7-8880 (144 cores of 2.3ghz; 1tb ram) with a load of 2.26 at this moment:

    $ time sleep 0.001
    real    0m0.004s
    user    0m0.001s
    sys     0m0.003s

On my i9-10900k (3.7ghz) current load of 3,31:

    $ time sleep 0.001

    real    0m0,002s
    user    0m0,000s
    sys     0m0,001s

(In case you think I'm measuring exec; time /bin/echo returns 0's on both machine)

Now as to why this is? Well in order to understand that, you need to understand how connect() actually works, and how to create a timeout for connect(). Those skilled in the art know you've got a number of choices on how to do it, but they all involve multiple steps because connect() does not take a timeout as an argument. Here's one way (not too different than what sendmail does/did):

    fcntl(f,F_SETFL,O_NONBLOCK);
    if(-1==connect(f,...)&&errno==EWOULDBLOCK){
      fd_set a;FD_ZERO(&a);FD_SET(f,&a);
      if(!select(f+1,&a,&a,NULL,{.tv_sec=0,.tv_usec=0})) {
        close(f);return error;
      }
    }

If you read this carefully, you only need to ask yourself how much time can pass between the top of connect() and the bottom of select(), and if you think it is zero like tedu does, you might probably have the same surprise: Computers are not abstract machines, but made out of matter and powered by energy and thus subject to the laws of physics, and so everything takes time.

For others, the surprise might be that it's still 3msec over twenty years later, and I think that is a much more interesting subject to explore than whether the speed of light exists.

  • > Obviously? I think I've had this phone call myself a few times, although in my experience it was never from a statistician and they didn't give me as much data, but I'm pretty sure the story is mostly accurate.

    Yeah, the original retelling even states up-front:

    > The story is slightly altered in order to protect the guilty, elide over irrelevant and boring details, and generally make the whole thing more entertaining.

    It's pretty common to alter minor details of stories in order to make them easier to follow, not to mention that the entire account is also written several years after it happened, when details are presumably less likely to be completely accurate. Obviously the dialogue is reconstructive for narrative ease; no reader would look at that and assume it's intended to be a verbatim transcript.

    Unless the author here can cite specific things that make it truly impossible for anything of that shape to have occurred, I'm not seeing anything that justifies the conclusion "there's a lot to the story that's obviously made up".

  • > 144 cores of 2.3ghz; 1tb ram

    I can't help but feel that's somewhat excessive for a desktop. Have you considered closing a few browser tabs?

    • > I can't help but feel that's somewhat excessive for a desktop.

      I got it on ebay for €2k. You can't not expect me to use it as a desktop.

      > Have you considered closing a few browser tabs?

      No? I mean actually no: I made a brotab+wofi script that allows me to search tabs, and I find it a lot more convenient than bookmarks.

      Here's the relevant bits:

          brotab_filter='{
           split($1,A,".");
           t=$2;
           gsub(/&/,  "\\&amp;",t); gsub(/</,  "\\&lt;",t); gsub(/>/,  "\\&gt;",t);
           print "<span size=\"xx-small\">"A[1]"."A[2]"</span><span size=\"xx-small\">."A[3]"</span> <span weight=\"bold\">Firefox</span> <span>"t"</span>"
          }';
      
          ( # more stuff is in here
          brotab list | awk -F" " "$brotab_filter" ) | \
          wofi -m --insensitive --show dmenu --prompt='Focus a window' | sed -e 's/<[^>]*>//g' | {
           read -r id name || exit 1
           case "$id" in
           exec) exec "$name" ;;
           [0-9]*)   swaymsg "[con_id=$id]" focus ;;
           [a-z]\.*)
            brotab activate "$id"; sleep 0.2;
            swaymsg "[title=\"${name#Firefox }\"]" focus
            ;;
           esac
          }
      

      Works fine on 19,294 tabs at the moment...

      3 replies →

  • I thought the 3ms was more or less what a low-granularity clock would give you. So, not the clock that gives you nanos, but the big standard one that is useful if you just somewhat care that some timer has run out. Perhaps you use it to count frames (120fps ~ 8.3ms) or check whether some calendar event has happened.

    A 333 Hz clock seems like something you might have on computers going back to those days, even if not for the CPU.

  • Never got this, honestly.

    Well, first light does 500 miles in 3ms, but the connect signal needs to come back, right? So it should be 250 miles, at most? But this is just a detail.

    More importantly, because it seems to assume that all other operations besides the signal actually reaching the destination are instantaneous. As you point out yourself, computers are not abstract machines, so the actual response time between the signal being received by the destination (even assuming it's just one straight line with zero electronics in between) and the destination replying is not zero. I imagine there can be a large variation between physical installations and different types of hardware, so much as to make it very hard to detect a clear 500 miles boundary.

    Or am I missing something?

    • > Well, first light does 500 miles in 3ms, but the connect signal needs to come back, right? So it should be 250 miles, at most?

      I don't think this is terribly important (NB my examples have nothing to do with networking), but in the author's case it was probably the other way; maybe 10msec and a bit more: Copper gets up to ~0.6c but I think this detail makes the story less amusing, and is a distraction from wondering why does select() take so long...

      > I imagine there can be a large variation between physical installations and different types of hardware

      There is probably not as much as you think, and Sendmail retries, so with whatever variation that exists, only the bounds really matter.

      > Or am I missing something?

      Modern unixish systems have the same log-scale delay coming out of select() so this has almost nothing to do with the hardware being slower or variability in the network.

We have a program which the company who developed lost the ability to rebuild the app for some reason.

It has a 500ms timeout to load some settings from a server in the UK via TLS. If it goes more than that 500ms (or something, it's unclear the exact timeout cause) the app just vapourises.

This is fine in the UK, TLS needs about* 3 times RTT to complete though, so an RTT above about 160ms and it's screwed.

Almost all our users are in the UK, europe, mid-east or east coast USA, and in that 160ms RTT range.

We ran into issues when a dozen people tried to use it in Australia, so the principal still happens with some badly written code.

  • Patching the binary is an option, though it’s tricky I would attempt it anyway.

    • Alas an ios app, which is in the app store.

      It was a "nice feature" of an older unrelated system, we're going through a procurement to replace it, as patching a binary is fine for me, but not really a supportable solution at an enterprise level :D

> there was a university president who couldn’t send an email more than 500 miles, and the wise sysadmin said that’s not possible, so the president said come to my office, and lo and behold, the emails stopped before going 500 miles.

NO. NO NO NO.

How can you get SO MANY facts wrong when the freaking story is googlable?

Here's the original email: https://web.mit.edu/jemorris/humor/500-miles

Here's the FAQ that covers the ambiguous parts: https://www.ibiblio.org/harris/500milemail-faq.html

This annoys me because I know the original author and I remember when this happened (he told the story a few times).

Let's recap:

> there was a university president

NO! It was the chairman of the statistics department.

> who couldn’t send an email more than 500 miles,

True. Being in the statistics department he had the tools to make actual maps.

> and the wise sysadmin said that’s not possible, so the president said come to my office

Kind of true. There was an office involved.

> and lo and behold, the emails stopped before going 500 miles.

True.

> There’s a lot to the story that’s obviously made up,

NO! Zero of this story was made up.

ALL the people that were involved in the story are still alive. You can literally get them on the phone and talk to them. We're not debating whether or not Han Solo ever used a light saber. THIS SHIT REALLY HAPPENED.

Sheesh.

  • The whole tone of TFA is super frustrating. The original story is funny, well written, and, well, a story meant to be told "over drinks at a conference". TFA misses all the humor and generally seems to be written by a very bitter person.

  • I think the entire story was made up, but certainly the author admits much of it was made up.

    > 4. If you are not 100% sure of the details, then why are there so many details in the story?

    > Because with the details, the story looks much better. Do you really think that if I started each sentence with the words “I don’t remember exactly, but it seems to be ...”, then something would have changed? In the end, at the very beginning, I warned that some minor details were changed, and some were intentionally omitted - just to make the story better.

    > The second important point is the site where the story was first published. I sent this story to the SAGE (System Administrators Guild) mailing list in the "incredible challenges" section. These were just stories about the most incredible tasks that management sometimes puts to system administrators.

    And let's not forget the the post ends with:

    > I'm looking for work. If you need a SAGE Level IV with 10 years Perl, tool development, training, and architecture experience, please email me at trey@sage.org. I'm willing to relocate for the right opportunity.

    Which I think was the true purpose of the (to me) made up story.

> The poll timeout is 3ms, as specified by the lore. I think this is nonsense, why would an invalid or incomplete sendmail configuration default to three milliseconds?

The answer is that per the original story, it was not defaulting to three milliseconds. It was defaulting to 0, and the 3ms was just how long it took the system to check for a response with a 0 timeout:

> Some experimentation established that on this particular machine with its typical load, a zero timeout would abort a connect call in slightly over three milliseconds.

This is a very different scenario, as it's not clear there should be a poll() there at all (or more likely select() given the age of the story) to match the original, but if there was, the select would have a timeout of 0, not 3ms, and would just happen to be unable to distinguish between 0 and up to 3ms.

  • Yeah, the article is a good one overall, but the truthering is obnoxious, especially since it hinges on a basic misreading of the original story.

    • The original story is also about the statistics department, not the university president. It would be nice to get such details right.

I really wouldn't have predicted the extreme amount of centralization, and arguably unnecessary centralization, that we have today for things like university email and web servers. Even 20 years ago when I was in college, the servers I interacted with including email, were all in our school's /16. They did have software packages for LMS and stuff, but those were mostly deployed on-prem.

Today the websites are hosted on third party cloud servers (my school's main website is some company that hosts your Wordpress or Drupal site so you don't have to) and the email by Microsoft or Google. Same for every school it seems. I guess the IT department that used to run all the infra is now probably just a few people in charge of ordering new laptops for faculty/staff when they break, and replacing Wi-Fi access points every 5 years.

  • Spam is another reason most places don't bother with selfhosting email now. Big providers like GMail aggressively filter unknown servers so if you attempt to host your own and don't setup everything perfectly (or even if you do and you trigger their filter ban threshold) all your email will silently fail to deliver or be blackholed to the Spam folder for the largest email providers and you might never find out or have a way to get them to reconsider.

    • Something I faced at multiple companies before Google and MS took over was that malware would get on your mail server and start blasting out spam. And then you'd find yourself on a bunch of blacklists thinking they made a mistake but it's because your server was actually spamming.

      1 reply →

  • You totally could make that prediction just by thinking about a number of schools in the world, a number of /16s in ipv4 and a rate of ipv6 adoption.

    Typically that "IT department" was just a few CS teachers, who assigned some slacking students creating a webpage as a homework, and replacing a bad memory in a server computer as a lab work, and then gave up when that become impossible.

  • Meh. Decades ago most universities realised it made sense to separate "run the IT infrastructure that the university runs on" from the CS department, and after that the university IT department followed the same trajectory as the IT department of every other large institution. It doesn't really make sense to run your own email servers any more if your core business is a paper merchant or steel mill or whatever, and it's the same for universities.

    I'm sure the same thing happened with e.g. electricity - at first people in the physics department ran their own generators, then at some point the university was using enough electricity for day-to-day stuff like lights that the main generators moved to being operated by the facilities department, and nowadays the university just gets their electricity from the local wholesaler like every other big organisation and probably doesn't have a whole lot of transformer expertise in their maintenance department.

I clicked the story wondering if the speed of light has changed since the late 90s.

Apparently not.

Is there a library to re-introduce relevant delays into a CDN so that all users experience their own geographically-appropriate response times?

I mean, I want reliability. But I also want Europeans to be able to taste that authentic latency they'd expect from a fledgling startup running out of a garage in San Jose.

Don't get it

  • Data can only go about 500 miles in 3ms, and in the original story, that's how long the system took to time out, and would fail to send the email

  • It’s a nerd story about short timeouts. Effectively a what is the speed of light or electricity in copper and over infrastructure. It’s a joke that doesn’t make any sense because 3ms was clearly bullshit devised for example. Don’t think about it too hard, it doesn’t suddenly snap into anything meaningful.

    • Why do you think it's "clearly bullshit"?

      connect() will take time. Either you then fail on reiceiving EINPROGRESS, or you attempt a select() with 0 for the timeout, which will also take time. That that time could add up to 3ms on a mid-90's system also used for other things seems entirely plausible to me.

      1 reply →