NIST was 5 μs off UTC after last week's power cut

2 months ago (jeffgeerling.com)

161 comments

jtokoph

I found the most interesting part of the NIST outage post [1] is NIST's special Time Over Fiber (TOF) program [2] that "provides high-precision time transfer by other service arrangements; some direct fiber-optic links were affected and users will be contacted separately."

I've never heard of this! Very cool service, presumably for … quant / HFT / finance firms (maybe for compliance with FINRA Rule 4590 [3])? Telecom providers synchronizing 5G clocks for time-division duplexing [4]? Google/hyperscalers as input to Spanner or other global databases?

Seriously fascinating to me -- who would be a commercial consumer of NIST TOF?

[1] https://groups.google.com/a/list.nist.gov/g/internet-time-se...

[2] https://www.nist.gov/pml/time-and-frequency-division/time-se...

[3] https://www.finra.org/rules-guidance/rulebooks/finra-rules/4...

[4] https://www.ericsson.com/en/blog/2019/8/what-you-need-to-kno...

dmurray 2 months ago
I never saw a need for this in HFT. In my experience, GPS was used instead, but there was never any critical need for microsecond accuracy in live systems. Sub-microsecond latency, yes, but when that mattered it was in order to do something as soon as possible rather than as close as possible to Wall Clock Time X.
Still useful for post-trade analysis; perhaps you can determine that a competitor now has a faster connection than you.
The regulatory requirement you linked (and other typical requirements from regulators) allows a tolerance of one second, so it doesn't call for this kind of technology.
- blibble 2 months ago
  
  > I never saw a need for this in HFT. In my experience, GPS was used instead, but there was never any critical need for microsecond accuracy in live systems.
  mifid ii (uk/eu) minimum is 1us granularity
  https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:...
  
  5 replies →
goalieca 2 months ago
My guess would be scientific experiments where they need to correlate or sequence data over large regions. Things like correlating gravitational waves with radio signals and gamma ray bursts.
- prpl 2 months ago
  
  those are GPS based too. You typically would have a circuit you trained off off 1PPS and hopefully had a 10 or so satellites in view.
  You can get 50ns with this. Of course, you would verify at NIST.
  
  12 replies →
bob1029 2 months ago

> a commercial consumer
Where does it say these are commercial consumers?
https://en.wikipedia.org/wiki/Schriever_Space_Force_Base#Rol...
> Building 400 at Schriever SFB is the main control point for the Global Positioning System (GPS).
throw0101c 2 months ago

> I've never heard of this! Very cool service, presumably for … quant / HFT / finance firms (maybe for compliance with FINRA Rule 4590 [3])?
To start with, probably for scientific stuff, à la:
* https://en.wikipedia.org/wiki/White_Rabbit_Project
But fibre-based time is important in case of GNSS time signal loss:
* https://www.gpsworld.com/china-finishing-high-precision-grou...
esseph 2 months ago
I'm sure all of that is true, but so is "Department of Defense".
They're also the largest holder of IPv4 space, still. https://bgp.he.net/report/peers#_ipv4addresses
- squigz 1 month ago
  
  Why does the DoD hold so many IPv4s?
  
  2 replies →
ignoramous 1 month ago

> Google/hyperscalers as input to Spanner or other global databases?
Think Google might have rolled their own clock sources and corrections.
Ex: Sundial, https://www.usenix.org/conference/osdi20/presentation/li-yul... / https://storage.googleapis.com/gweb-research2023-media/pubto... (pdf)
mmaunder 2 months ago
SIGINT as a source clock for others in a network doing super accurate TDOA for example.
- anilakar 1 month ago
  
  But they do not need absolute time, and internal rubidium clocks can keep the required accuracy for a few days. After that, sync can be transferred with a portable plug, which is completely viable in tactical/operational level EW systems.
secondcoming 2 months ago
I think Google uses chrony instead of NTP
- creatonez 2 months ago
  
  Google doesn't use chrony specifically, just an algorithm that is somewhat chrony-like (but very different in other ways). It's called Google TrueTime.
  
  2 replies →
machinationu 2 months ago

science equipment, distributed radio-telescopes where you need to precisely align data received at different locations

loph 2 months ago

Only Boulder servers lost sync.

To say NIST was off is clickbait hyperbole.

This page: https://tf.nist.gov/tf-cgi/servers.cgi shows that NIST has > 16 NTP servers on IPv4, of those, 5 are in Boulder and were affected by the power failure. The rest were fine.

However, most entities should not be using these top-level servers anyway, so this should have been a problem for exactly nobody.

IMHO, most applications should use pool.ntp.org

crazydoggers 2 months ago

I believe if you use time.nist.gov it round robins dns requests, so there’s a chance you’d have connected to the Boulder server. So for some people they would have experienced NIST 5 μs off.
NetMageSCW 2 months ago
Who does use those top-level servers? Aren’t some of them propagating the error or are all secondary level servers configured to use dispersed top-level servers? And how do they decide who is right when they don’t match?
Is pool.ntp.org dispersed across possible interference and error correlation?
- mcpherrinm 2 months ago
  
  You can look at who the "Stratum 2" servers are, in the NTP.org pool and otherwise. Those are servers who sync from Stratum 1, like NIST.
  Anyone can join the NTP.org pool so it's hard to make blanket statements about it. I believe there's some monitoring of servers in the pool but I don't know the details.
  For example, Ubuntu systems point to their Stratum 2 timeservers by default, and I'd have to imagine that NIST is probably one of their upstreams.
  An NTP server usually has multiple upstream sources and can steer its clock to minimize the error across multiple servers, as well as detecting misbehaving servers and reject them ("Falseticker"). Different NTP server implementations might do this a bit differently.
  
  1 reply →
- yardstick 2 months ago
  
  From my own experience managing large numbers of routers, and troubleshooting issues, I will never use pool.ntp.org again. I’ve seen unresponsive servers as well as incorrect time by hours or days. It’s pure luck to get a good result.
  Instead I’ll stick to a major operator like Google/Microsoft/Apple, which have NTP systems designed to handle the scale of all the devices they sell, and are well maintained.

ziml77 2 months ago

Nitpick: UTC stands for Coordinated Universal Time. The ordering of the letters was chosen to not match the English or the French names so neither language got preference.

userbinator 2 months ago
Universal Time, Coordinated.
- ChadMoran 2 months ago
  
  This is how I say it in my head.
  
  1 reply →
O5vYtytb 2 months ago
That doesn't quite match what the wikipedia page says:
> The official abbreviation for Coordinated Universal Time is UTC. This abbreviation comes as a result of the International Telecommunication Union and the International Astronomical Union wanting to use the same abbreviation in all languages. The compromise that emerged was UTC, which conforms to the pattern for the abbreviations of the variants of Universal Time (UT0, UT1, UT2, UT1R, etc.).
- shawnz 2 months ago
  
  Follow the citation: https://www.nist.gov/pml/time-and-frequency-division/how-utc...
  > ... in English the abbreviation for coordinated universal time would be CUT, while in French the abbreviation for "temps universel coordonné" would be TUC. To avoid appearing to favor any particular language, the abbreviation UTC was selected.
dagurp 1 month ago

It's also the time in Iceland, conveniently
ambicapter 2 months ago
That's an interesting rationale.
- hunter2_ 2 months ago
  
  Reminds me of when a group is divided into two parts, dubbed group 1 and group A, such that neither feels secondary.
  
  1 reply →
- asdfman123 1 month ago
  
  The trick is making sure no one is happy with the final outcome
stavros 2 months ago
It also stands for Universel Temps Coordonné.
- Aloisius 2 months ago
  
  It's le temps universel coordonné in French.

ComputerGuru 2 months ago

Not exactly the topic of discussion but also not not on topic: just wanted to sing praise for chrony which has performed better than the traditional os-native NTP clients in our testing on a myriad of real and virtualized hardware.

steve1977 2 months ago

Chrony is the default already in some distros (RHEL and SLES that I know of), probably for this very reason.

politelemon 2 months ago

I'm missing the nuance or perhaps the difference between the first scenario where sending inaccurate time was worse than sending no time, versus the present where they are sending inaccurate time. Sorry if it's obvious.

opello 2 months ago
The 5us inaccuracy is basically irrelevant to NTP users, from the second update to the Internet Time Service mailing list[1]:
To put a deviation of a few microseconds in context, the NIST time scale usually performs about five thousand times better than this at the nanosecond scale by composing a special statistical average of many clocks. Such precision is important for scientific applications, telecommunications, critical infrastructure, and integrity monitoring of positioning systems. But this precision is not achievable with time transfer over the public Internet; uncertainties on the order of 1 millisecond (one thousandth of one second) are more typical due to asymmetry and fluctuations in packet delay.
[1] https://groups.google.com/a/list.nist.gov/g/internet-time-se...
- zahlman 2 months ago
  
  > Such precision is important for scientific applications, telecommunications, critical infrastructure, and integrity monitoring of positioning systems. But this precision is not achievable with time transfer over the public Internet
  How do those other applications obtain the precise value they need without encountering the Internet issue?
  
  17 replies →
BuildTheRobots 2 months ago

It's a good question, and I wondered the same. I don't know, but I'd postulate:
As it stands at the minute, the clocks are a mere 5 microseconds out and will slowly get better over time. This isn't even in the error measurement range and so they know it's not going to have a major effect on anything.
When the event started and they lost power and access to the site, they also lost their management access to the clocks as well. At this point they don't know how wrong the clocks are, or how more wrong they're going to get.
If someone restores power to the campus, the clocks are going to be online (all the switches and routers connecting them to the internet suddenly boot up), before they've had a chance to get admin control back. If something happened when they were offline and the clocks drifted significantly, then when they came online half the world might decide to believe them and suddenly step change to follow them. This could cause absolute havoc.
Potentially safer to scram something than have it come back online in an unknown state, especially if (lots of) other things are are going to react to it.
In the last NIST post, someone linked to The Time Rift of 2100: How We lost the Future --- and Gained the Past. It's a short story that highlights some of the dangers of fractured time in a world that uses high precision timing to let things talk to each other: https://tech.slashdot.org/comments.pl?sid=7132077&cid=493082...
throw0101d 2 months ago

> […] where sending inaccurate time was worse than sending no time […]
When you ask a question, it is sometimes better to not get an answer—and know you have not-gotten an answer—then to get the wrong answer. If you know that a 'bad' situation has arisen, you can start contingency measures to deal with it.
If you have a fire alarm: would you rather have it fail in such a way that it gives no answer, or fail in a way where it says "things are okay" even if it doesn't know?

gnabgib 2 months ago

From NPR (22 points) https://news.ycombinator.com/item?id=46351105

Topgamer7 2 months ago

Out of curiosity, can anyone say the most impactful things they've needed incredibly accurate time for?

pezezin 2 months ago

I work at a particle accelerator. We use White Rabbit (https://white-rabbit.web.cern.ch/) to synchronize some very sensitive devices, mostly the RF power systems and related data acquisition systems, down to nanosecond accuracy.
rcleveng 2 months ago
Spanner
(See https://docs.cloud.google.com/spanner/docs/true-time-externa...)
- 0x457 2 months ago
  
  Does it need to be this close to NIST, or just relative to each other? Because the latter one is solved by PTP.
  
  1 reply →
Sanzig 2 months ago

Spacecraft state vectors.
srean 2 months ago
Not sure but synthetic massive aperture radio telescope would need syncing their local clocks.
I defer to the experts.
- jasonwatkinspdx 2 months ago
  
  As far as I'm aware they just timestamp the sample streams based on a local gps backed atomic reference. Then when they get the data/tapes in one computing center they can just run a more sophisticated correlation entirely in software to smooth things out.
dyauspitr 2 months ago

GPS
ted_dunning 2 months ago

As a very coarse number, 5µs is 1500 meters of radio travel.
If (and it isn't very conceivable) GPS satellites were to get 5µs out of whack, we would be back to Loran-C levels of accuracy for navigation.
abeyer 2 months ago
Telling people at the bar that I have an atomic clock at home.
- rcleveng 2 months ago
  
  cesium or rubidium?
  
  1 reply →

voidUpdate 2 months ago

Maybe I missed something, but I don't quite understand the video title "NIST's NTP clock was microseconds from disaster". Is there some limit of drift before it's unrecoverable? Can't they just pull the correct time from the other campus if it gets too far off?

asdfman123 1 month ago

You'll never guess why. The answer might shock you.
nottorp 1 month ago
I'll consider it clickbait...
... unless someone with real experience needing those tolerances chimes in and explains why it's true.
- voidUpdate 1 month ago
  
  I'd have thought Jeff to be above clickbait, but here we are
  
  4 replies →

ChrisArchitect 2 months ago

More discussion:

NTP at NIST Boulder Has Lost Power

https://news.ycombinator.com/item?id=46334299

V__ 2 months ago

Has anyone here ever needed microsecond precision? Would love to hear about it.

Aromasin 2 months ago

I worked at Altera (FPGA supplier) as the Ethernet IP apps engineer for Europe for a few years. All the big telecoms (Nokia, Ericsson, Cisco, etc) use Precision Time Protocol (PTP) in some capacity and all required clocks to be ns levels for accuracy. Sometimes as low a 10ns at the boundary. Any imperfection in the local clock directly converts into timestamp error, and timestamp error is what limits PTP synchronization performance. Timestamps are the fundamental observable in PTP. Quantization and jitter create irreducible timestamp noise. That noise directly limits offset and delay estimation. Errors accumulate across network elements and internal clock error must be much smaller than the system requirement.
I think most people would look at the error and think "what's the big deal" but at all the telecoms customers would be scrambling to find a clock that hasn't fallen out of sync.
sgillen 2 months ago
We don't use NTP, but for robotics, stereo camera synchronization we often want the two frames to be within ~10us of eachother. For sensor fusion we then also need a lidar on PTP time to be translated to the same clock domain as cameras, for which we also need <~10us.
We actually disable NTP entirely (run it once per day or at boot) to avoid clocks jumping while recording data.
- wpollock 2 months ago
  
  > We actually disable NTP entirely (run it once per day or at boot) to avoid clocks jumping while recording data.
  This doesn't seem right to me. NTP with default settings should be monotonic. So no jumps. If you disable it Linux enters 11-minute mode, IIRC, and that may not be monotonic.
  
  2 replies →
- robocat 2 months ago
  
  For a low precision environment to avoid sudden jumps I used SetSystemTimeAdjustment on Windows (now SetSystemTimeAdjustmentPrecise) to smoothly steer the system clock to match the GPS supplied time signal.
  On Linux I think the adjtimex() system call does the equivalent https://manpages.ubuntu.com/manpages/trusty/man2/adjtimex.2....
  It smears out time differences which is great for some situations and less ideal for others.
  
  1 reply →
- opello 2 months ago
  
  In your stereo camera example, are these like USB webcams or something like MIPI CSI attached devices?
  
  1 reply →
andrewxdiamond 2 months ago

We use nanosecond precision driven by GPS clocks. That timestamp in conjunction with star tracker systems gives us reliable positioning information for orbital entities.
https://en.wikipedia.org/wiki/Star_tracker
zamadatix 2 months ago

(Assuming "precision" really meant "accuracy") The network equipment I work on requires sub microsecond time sync on the network for 5G providers and financial trading customers. Ideally they'd just get it from GPS direct, but that can be difficult to do for a rack full of servers. Most of the other PTP use cases I work with seem to be fine with multiples of microseconds, e.g. Audio/Video over the network or factory floor things like PLCs tend to be find with a few us over the network.
Perhaps a bit more boring than one might assume :).
grumbelbart 2 months ago

Lightning detection. You have a couple of ground stations with known positions that wait for certain electromagnetic puses, and which record the timestamps of such pulses. With enough stations you can triangulate the location of the source of each pulse. Also a great way to detect nuclear detonations.
There is a german club that builds and distrubutes such stations (using GPS for location and timing), with a quite impressive global coverage by now:
https://www.blitzortung.org
immibis 2 months ago
I believe LTE and 5G networks require it to coordinate timeslots between overlapping cells. Of course, they can use whatever reference they want, as long as all the cells are using the same one - it doesn't have to be UTC. Some (parts of) networks transmit it across the network, while others have independent GPS receivers at each cell site.
Synchronization is also required for SDH networks. Don't know if those are still used.
Someone else referenced low power ham radio modes like WSPR, which I also don't know much about, but I can imagine they have timeslots linked to UTC and require accuracy. Those modes have extremely low data rates and narrow bandwidths, requiring accurate synchronization. I don't know if they're designed to self-synchronize, or need an external reference.
When multiple transmitters are transmitting the same radio signal (e.g. TV) they might need to be synchronized to a certain phase relationship. Again, don't know much about it
- ted_dunning 2 months ago
  
  WSPR doesn't require tight synchronization, but it does require pretty stable frequency sources over periods of 10s of seconds.
  It is very common to integrate a GPS in a WSPR beacon to discipline the transmit frequency, but with modest thermal management, very ordinary crystal oscillators have very nice stability.
peaseagee 2 months ago

At a previous role, we needed nanosecond precision for a simulcast radio communications system. This was to allow for wider transmission for public safety radio systems without having to configure trunking. We could even adjust the delay in nanoseconds to move the deadzones away from inhabited areas.
We solved this by having GPS clocks at each tower as well as having the app servers NTP with each other. The latter burned me once due to some very dumb ARP stuff, but that's a story for another day.
IceWreck 2 months ago

We need nanosecond precision for trading - basically timestamping exchange/own/other events and to measure latency.
marcosdumay 2 months ago

You probably want to ask about accuracy. Any random microcontroller from the 90s needs microsecond precision.
jeffbee 2 months ago

A database like Google Spanner has higher latency in proportion to the uncertainty about the time. Driving the time uncertainty down into the microsecond range, or lower, keeps latency low.
earslap 2 months ago

How do you even get usable microsecond precision sync info from a server thousands of kilometers away? The latency is variable so the information you get can't be verified / will be stale the moment it arrives. I'm quite ignorant on the topic.
esseph 2 months ago

Telecom.
Precision Time Protocol gets you sub-microsecond.
https://en.wikipedia.org/wiki/Precision_Time_Protocol
pi-rat 2 months ago

High speed finance is msec and below. Fastest publically known tick to trade is just shy of 14 nanos.
Timekeeping starts to become really hard, often requiring specialized hardware and protocols.
hnuser123456 2 months ago

The high frequency trading guys
edit: also the linked slides in TFA
bobmcnamara 2 months ago

Yes, but always got it from GPS so presumably they'd be off about the same amount.
Distributed sonar, allows placing receivers willy-nilly and aligning the samples later.
Remote microphone switching - though for this you wouldn't notice 5us jitter, it's just that the system we designed happened to have granularity that good.
idiotsecant 2 months ago
Lots of things do. Shoot, even plain old TDM needs timing precision on the order of picoseconds to nanoseconds.
- NetMageSCW 2 months ago
  
  But does that require accurate UTC?
  
  2 replies →
thadt 2 months ago

Nuclear measurements, where the speed of a gamma ray flying across a room vs a neutron is relevant. But that requires at least nanosecond time resolution, and you’re a long way from thinking about NTP.
withinboredom 1 month ago

Your speakers do so that people's voices match their mouth movements. The speaker clocks need to be in-sync with the cpu clocks and they operate at different frequencies.
ruszki 2 months ago

When we collected, correlated, and measured all controlling messages in a whole 4G network. Millisecond precision meant guaranteed out of order message flows.
themafia 2 months ago

If you want sample accurate audio transmission you're going to want resolution on the order of 10s of microseconds.
loeg 2 months ago

I mean, we routinely benchmark things that take microseconds or less. I've seen a 300 picosecond microbenchmark (single cycle at 3GHz). No requirement that absolute time is correct, though.

meindnoch 1 month ago

I know some HFT people who made a few hundred K off of this.

tiborsaas 1 month ago

Can you elaborate on how? Did they consciously exploit this or their systems just had a lucky glitch?

asdfman123 1 month ago

I find this topic and thread fascinating.

I took too much Adderall today.

mmmlinux 2 months ago

Are there any plans being made to prevent this happening in the future?

kibwen 2 months ago
Yes, the US government is banning all those democrat windmills that conspired to blow over the NTP server.
- Muromec 2 months ago
  
  It would greatly decrease electricity prices too. God knows how expensive it is to power those gigantic spinning things
  
  1 reply →
- groundzeros2015 2 months ago
  
  ??? The power outage was voluntary and surrounding towns chose not to turn off power. They could absolutely make infrastructure changes, and I’m sure the backups for power could make changes too.

geetee 2 months ago

Now I'm curious... How the hell do you synchronize clocks to such an extreme accuracy? Anybody have a good resource before I try to find one myself?

bestouff 2 months ago
Look up PTP White Rabbit.
- geetee 2 months ago
  
  Thank you!

qmr 2 months ago

Gah, just when you think you can trust time.nist.gov

Suggestions from the community for more reliable alternatives?

evanriley 2 months ago
> Gah, just when you think you can trust time.nist.gov
You still can...
If you're that considered about 5 microseconds: Build your own Stratum 1 time server https://github.com/geerlingguy/time-pi
or just use ntppool https://www.ntppool.org/en/
- beala 2 months ago
  
  It sounds like GPS, and thus a GPS-based stratum 1 server, uses these time servers, but they were successfully failed over:
  > Jeff finished off the email mentioning the US GPS system failed over successfully to the WWV-Ft. Collins campus. So again, for almost everyone, there was zero issue, and the redundancy designed into the system worked like it's supposed to.
  So failures in these systems are potentially correlated.
  The author mentions another solution. Apparently he runs his own atomic clock. I didn’t know this was a thing an individual could do.
  > But even with multiple time sources, some places need more. I have two Rubidium atomic clocks in my studio, including the one inside a fancy GPS Disciplined Oscillator (GPSDO). That's good for holdover. Even if someone were jamming my signal, or my GPS antenna broke, I could keep my time accurate to nanoseconds for a while, and milliseconds for months. That'd be good enough for me.
  
  2 replies →
- eddyg 2 months ago
  
  Be aware that there are members of the NTP pool with less-than-honorable intentions and you don't get to pick-and-choose. Yes, they all should provide the time, but they also get your IP address.
  For example: unlike the IPv4 space, the IPv6 space is too big too scan, so a number of "researchers" (if you want to call them that) put v6-capable NTP servers in the NTP pool to gather information about active v6 blocks to scan/target.
  
  2 replies →
ianburrell 2 months ago

Most places that need accurate time get it from GPS. That is 10-100 ns.
Also, you can use multiple NIST servers. They have ones in Fort Collins, CO and Gaithersburg, MD. Most places shouldn't use NIST directly but Stratum 1 name servers.
Finally, NTP isn't accurate enough, 10-100 ms, for microsecond error to matter.
vel0city 2 months ago

Use the other servers as well: https://tf.nist.gov/tf-cgi/servers.cgi
For instance, time-a-wwv.nist.gov.
One should configure a number of different NTP sources instead of just a single host.
ajkjk 2 months ago

their handling it responsibly seems like more evidence for trusting them, not less?
ssl-3 2 months ago

Yes.
Use NTP with ≥4 diverse time sources, just as RFC 5905 suggests doing. And use GPS.
(If you're reliant upon only one source of a thing, and that thing is important to you in some valuable way, then you're doing it wrong. In other words: Backups, backups, backups.)
monster_truck 2 months ago

I'm more concerned about what you think they did to earn your trust in the first place