Comment by krystofbe

15 hours ago

Looks like a DNSSEC issue, not a nameserver outage. Validating resolvers SERVFAIL on every .de name with EDE:

RRSIG with malformed signature found for a0d5d1p51kijsevll74k523htmq406bk.de/nsec3 (keytag=33834) dig +cd amazon.de @8.8.8.8 works, dig amazon.de @a.nic.de works. Zone data is intact, DENIC just published an RRSIG over an NSEC3 record that doesn't validate against ZSK 33834. Every validating resolver therefore refuses to answer.

Intermittency fits anycast: some [a-n].nic.de instances still serve the previous (good) signatures, so retries occasionally land on a healthy auth. Per DENIC's FAQ the .de ZSK rotates every 5 weeks via pre-publish, so this smells like a botched rollover.

87 comments

krystofbe

qazwsxedchac 14 hours ago

So a single configuration mistake in a single place wiped out external reachability of a major economy. It happened in the evening local time and should be fixable, modulo cache TTLs, by morning. This will limit the blast radius somewhat.

Still, at this level, brittle infrastructure is a political risk. The internet's famous "routing around damage" isn't quite working here. Should make for an interesting post mortem.

belorn 13 hours ago
I am reminded of the warning that zonemaster gives about putting your domain name servers on a single AS, as is common practice for many larger providers. A lot of people do not want others to see this as a problem since a single AS is a convenient configuration for routing, but it has the downside of being a single point of failure.
Building redundant infrastructure that can withstand BGP and DNS configuration mistakes are not that simple but it can be done.
- walrus01 11 hours ago
  
  As the CPU/RAM resources to run an authoritative-only slave nameserver for a few domains are extremely minimal (mine run at a unix load of 0.01), it's a very wise idea to put your ns3 or something at a totally different service provider on another continent. It costs less than a cup of coffee per month.
  
  3 replies →
- deepsun 8 hours ago
  
  On Google cloud it's always four nameservers like
  ns-cloud-c1.googledomains.com ns-cloud-c2.googledomains.com ns-cloud-c3.googledomains.com ns-cloud-c4.googledomains.com
  Would not make any sense to do four of them if it's a single AZ. Also, they are geo-aware and routed to your nearest region.
  
  2 replies →
pocksuppet 14 hours ago
DNS is a centralization risk, yes. Somehow we've decided this is fine. DNSSEC isn't the only issue - your TLD's nameservers could also be offline, or censored in your country.
- skywhopper 13 hours ago
  
  DNS is barely centralized. Is there an alternative global name lookup system that is less centralized without even worse downsides?
  
  3 replies →
- greatgib 13 hours ago
  
  Normally it should not have been, with cache and all, but that was the past...
  Think about what would happen the day that letsencrypt is borken for whatever reason technical or like having a retarded US leader and being located in the wrong country. Taken into account the push of letsencrypt with major web browsers to restrict certificate validities for short periods like only a few days...
  
  9 replies →
- cyberax 13 hours ago
  
  Not really? .com and .net are still up
  If Let's Encrypt goes down, half of the Internet will become inaccessible in a week.
  
  8 replies →
gerdesj 12 hours ago
"The internet's famous "routing around damage" isn't quite working here."
DNS is a look up service that runs on the internet.
Internet routing of IP packets is what the internet does and that is working fine (for a given value of fine).
You remind me of someone using the term "the internet is down" that really means: "I've forgotten my wifi password".
- LastTrain 11 hours ago
  
  Us non pod-people caught his drift.
  
  1 reply →
Woodi 7 hours ago

> So a single configuration mistake in a single place wiped out external reachability of a major economy.
Real world beats sci-fi :) And isn't it why we love IT ? And hate it too, because of "peoples in charge"...
the8472 14 hours ago
fail-closed protocols have introduced some brittleness. A HTTP 1.0 server from 1999 probably still can service visitors today. A HTTPS/TLS 1.0 server from the same year wouldn't.
- zelon88 6 hours ago
  
  I think I see the point you're making here and I agree.
  There is designing something to be fail-closed because it needs to be secure in a physical sense (actually secure, physically protected), and then there's designing something fail-closed because it needs to be secure from an intellectual sense (gatekept, intellectually protected). While most of the internet is "open source" by nature, the complexity has been increased to the point where significant financial and technical investment must be made to even just participate. We've let the gatekeepers raise the gates so high that nobody can reach them. AI will let the gatekeepers keep raising the gates, but then even they won't be able to reach the top. Then what?
  I think the point you're trying to make, put another way is in the context of "availability" and "accessibility" we've compromised a lot of both availability and accessibility in the name of security since the dawn of the internet. How much of that security actually benefits the internet, and how much of that security hinders it? How much of it exists as a gatekeeping measure by those who can afford to write the rules?
- account42 1 hour ago
  
  Backwards compatibility is unfortunately not something security folk care about.
- sam_lowry_ 3 hours ago
  
  This is why I still run my blog on HTTP/1.1 only.
  
  2 replies →
- fc417fc802 9 hours ago
  
  You're not wrong but objecting to fail-closed in a security sensitive context is entirely missing the point.
Muromec 13 hours ago
>So a single configuration mistake in a single place wiped out external reachability of a major economy.
And fuck nothing at all happened as a result.
- Our_Benefactors 12 hours ago
  
  Prove it? I’m sure many lifespans were lost to stress
  
  1 reply →
number6 7 hours ago

There is the kritis (critical infrastructure law) law, which trys to enforce some standards to make things not as brittle.
lschueller 13 hours ago

I have a bad feeling, that the impact will be quite severe for some services, as monitoring, performance, and security services might get disrupted. and just cleaning up is a big mess.. Worst case, some ot will experience outage and / or damage. But maybe I am just overestimating the severity of this.
walrus01 14 hours ago
It looks like a failed key replacement during a scheduled maintenance event. Normally this sort of thing is thoroughly tested and has multiple eyes on for detailed review and planning before changes get committed, but obviously something got missed.
- account42 1 hour ago
  
  Would be interesting to know how something could get missed. You'd think the system was set up so that new keys could not be published without being verified working in a staging system.
otabdeveloper4 7 hours ago
> The internet's famous "routing around damage"
...is only for Pentagon networks and military stuff. It's not for us normal people. (We get Cloudflare and FAANG bullshit instead.)
- zelon88 7 hours ago
  
  This is actually startlingly true.
  Every FAANG company has their own fiber backbone. Why invest the internet that everyone uses when you can invest in your own private internet and then sell that instead?
  
  2 replies →

dlopes7 14 hours ago

I love how I work with IT for 20 years and don't understand a single acronym here other than DNSSEC

icedchai 14 hours ago
I've been in IT 30+ years, been running DNS, web servers, etc. since at least 1994. I haven't bothered with DNSSEC due to perceived operational complexity. The penalty for a screw up, a total outage, just doesn't seem worth the security it provides.
- gerdesj 11 hours ago
  
  That was my experience too until I decided that just running email systems for 30 odd years when HN says that is unnatural piqued my weird or something!
  I ran up three new VMs on three different sites. I linked all three systems via a private Wireguard mesh. MariaDB on each VM bound to the wg IP and stock replication from the "primary". PowerDNS runs across that lot. One of the VMs is not available from the internet and has no identity within the DNS. The idea is that if the Eye of Sauron bears down on me, I can bring another DNS server online quite quickly and fiddle the records to bring it online. It also serves as a third authority for replication.
  I also deployed https://github.com/PowerDNS-Admin/PowerDNS-Admin which is getting on a bit and will be replaced eventually but works beautifully.
  Now I have DNS with DNSSEC and dynamic DNS and all the rest. This is how you start signing a zone and PowerDNS will look after everything else:
  # pdnsutil secure-zone example.co.uk # pdnsutil zone set-nsec3 example.co.uk # pdnsutil zone rectify example.co.uk
  Grab a test zone and work it all out first, it will cost you not a lot and then go for "production".
  My home systems are DNSSEC signed.
- qingcharles 11 hours ago
  
  How simple sysadmin was in 1994 with no cryptography on any protocol. Everything could be easily MITM'd. Your credit card number would get jacked left and right in the 90s.
  
  5 replies →
walrus01 14 hours ago
To be fair, advanced real world knowledge of public/private key PKIs (x.509 or other), things like root CAs, are a fairly esoteric and very specialized field of study. There's people whose regular day jobs are nothing but doing stuff with PKI infrastructure and their depth of knowledge on many other non-PKI subjects is probably surface level only.
- hannob 14 hours ago
  
  I know quite a bit about PKI and X.509, and I can tell you that much: the overlap with how DNSSEC works is limited.
  
  1 reply →
- hathawsh 14 hours ago
  
  Is that actually true, though? Even though it's not really my job, I find myself debugging certificates and keys at least once a month, and that's after automating as much as possible with certbot and cloud certificates. PKI always seems to demand attention.
  
  1 reply →
- mschuster91 14 hours ago
  
  It's not made easier by the fact that a lot of cryptography is either very old and arcane or it's one hell of a mess of code that doesn't make sense without reading standards.
  I had the misfortune of having to dig deep into constructing ASN.1 payloads by hand [1] because that's the only thing Java speaks, and oh holy hell is this A MESS because OF COURSE there's two ways to encode a bunch of bytes (BIT STRING vs OCTET STRING) and encoding ed25519 keys uses BOTH [2].
  And ed25519 is a mess in itself. The more-or-less standard implementation by orlp [3] is almost completely lacking any comments explaining what is going on where and reading the relevant RFCs alone doesn't help, it's probably only understandable by reading a 500 pages math paper.
  It's almost as if cryptographers have zero interest in interested random people to join the field.
  End of rant.
  [1] https://github.com/msmuenchen/meshcore-packets-java/blob/mai...
  [2] https://datatracker.ietf.org/doc/html/rfc8410#appendix-A
  [3] https://github.com/orlp/ed25519/tree/master
  
  16 replies →
bflesch 14 hours ago

Don't worry, that's by design ;)