Comment by jethro_tell
10 hours ago
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
10 hours ago
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
Yes, I concur.
Sometimes the circular dependencies get almost cartoonishly silly.
Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."
I made that example up, but only barely.
We had a pair of machines. And some bright spark set them up to mount each others NFS shares. after a power outage "Holy mother of chicken and egg NFS hangs batman"
That was a weird job, fun, it was a local machine room for a warehouse that originally held the IBM mainframe, it still held it's successor "the multiprise 3000" which has the claim to fame as being the smallest mainframe IBM ever sold. But now the room was also full of decades of artisanal crafted unix servers with pick databases. the pick dev team had done most the system architecture. The best way to understand it is that for them pick is the operating system, unix is a necessary annoyance they have to put up with only because nobody has made pick hardware for 20 years. and it was NFS mounts everywhere, somebody had figured out a trick where they could NFS mount a remote machine and have the local pick system reach in and scrounge through the remote systems data. But strictly read-only. pick got grumpy when writing to NFS not to say anything about how the other database would feel about having it's data being messed with. Thus the circular mount.
Still was not the worst thing I saw. I liked the one system with a SMB mount. "Why is this one SMB?" "Well pick complains when you try to write to a NFS mount, but it's NFS detection code does not trip on SMB mounts." ... Sighs "Um... I am no pick expert but you know why it does not like remote mounts right. SMB does not change that, Do you happen to get a lot of corrupt indexes on this machine?" "yes, how did you know"
Oh, yeah, re-exporting NFS mounts via SMB was very much a thing in the early 2000s - something to do with their different approaches to flock() vs fcntl() handling. If you ran into locking issues with nfs, then re-exporting via SMB was standard advice.
At some point, the behaviour changed and locks starting conflicting. IIRC, we hit it when upgrading to Debian Etch and took the time to unwind the system and make pure NFS work properly for us. Plenty of people took the opposite approach, and fiddled with the config to make locking a noop on SMB. I know of at least one web hosting company who ended up having to restore a year's worth of customer uploads from backups as a result...
A real example, from Facebook's 2021 outage [1]:
> Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.
There was one (later denied) report that a 'guy with an angle grinder' was involved in gaining access to the server cage.
[1] https://news.ycombinator.com/item?id=28762611
Why would such a critical server even be accessible with only one set of keys?
I’ve always thought mission critical stuff needs two independent key holders, with key holes placed far apart enough to make it impossible for 1 person to reach both.
They're not actually accessible with 'only one set of keys' in my experience.
You actually have to present your photo ID at the site entry gatehouse, then again to the building entry guard (who will also check you have a work permit and a site-specific safety induction) then you swipe a badge at a turnstile to get from reception into the stairwell, then swipe your badge at a door to get into the relevant floor, then swipe your badge and key in a code to enter the room with the cages then you use the key.
Other than for certain nuclear missile launches[1], that only happens in the movies.
[1] https://www.nationalmuseum.af.mil/Visit/Museum-Exhibits/Fact...
1 reply →
a circular dependency and a single point of failure are not the same thing. If I have a single point of failure and it is down, I fix that and things work again. If I have circular dependency, there is no obvious way to fix anything that is broken any longer.