← Back to context

Comment by steve1977

15 hours ago

I don't find the wording in the RFC to be that ambiguous actually.

> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".

I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it.

But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:

1. Changes logic in how CNAME responses are formed

2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")

3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".

4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".

There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.

I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be

  • > 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

    "Testing environment" sounds to me like a real network real user devices are used with (like the network used inside CloudFlare offices). That's what I would do if I was developing a DNS server anyway, other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case) and maybe integration/end-to-end tests, which might be running in Alpine Linux containers and as such using musl. If that's indeed the case, I can easily imagine how noone noticed anything was broken. First look at this line:

    > Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

    Now think about what real end user devices are using: Windows/macOS/iOS obviously aren't using glibc and Android also has its own C library even though it's Linux-based, and they all probably fall under the "Most DNS clients don't have this issue.".

    That leaves GNU/Linux, where we could reasonably expect most software to use glibc for resolving queries, so presumably anyone using Linux on their laptop would catch this right? Except most distributions started using systemd-resolved (most notable exception is Debian, but not many people use that on desktops/laptops), which is a locally-cached recursive DNS server, and as such acts as a middleman between glibc software and the network configured DNS server, so it would resolve 1.1.1.1 queries correctly, and then return the results from its cache ordered by its own ordering algorithm.

    • > other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case)

      They absolutely should have unit tests that detect any change in output and manually review those changes for an operation of this size.

  • > I assume some tests at least broke that meant they needed to be "fixed up"

    OP said:

    "However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC."

    One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely.

    • You'd think that something this widely used would have golden tests that detect any output change to trigger manual review but apparently they don't.

  • > Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

    This is the part that is shocking to me. How is getaddrinfo not called in any unit or system tests?

    • As black3r mentioned (https://news.ycombinator.com/item?id=46686096), it is likely rearranged by systemd, therefore only non-systemd glibc distributions are affected.

      I would hazard a guess that their test environment have both the systemd variant and the Unbound variants (Unbound technically does not arrange them, but instead reconstructs it according to RFC "CNAME restart" logic because it is a recursive resolver in itself), but not just plain directly-piped resolv.conf (Presumably because who would run that in this day and age. This is sadly just a half-joke, because only a few people would fall on this category.)

The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit reordering everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.

Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.

All of which also serves to highlight the value of normative language, but that came later.

  • it wouldn't be a problem if they tested it properly... especially WHEN stuff is ambigous

    • They may not have realized their interpretation is ambiguous until after the incident, that’s the kind of stuff you realize after you find a bug and do a deep dive in the literature for a post mortem. They probably worked with the certitude that record order is irrelevant until that point.

I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.

And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.

Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.

(ETA:) Although admittedly, while the RFC does say that CNAMEs must come before As in the answer, I don’t necessarily see any clear rule about how CNAME chains must be ordered; the RFC just says “Domain names in RRs which point at another name should always point at the primary name and not the alias ... Of course, by the robustness principle, domain software should not fail when presented with CNAME chains or loops; CNAME chains should be followed”. So actually I guess I do agree that there is some ambiguity about the responses containing CNAME chains.

Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.

Even if 'possibly preface' is interpreted to mean CNAME RRSets should appear first there is still a broken reliance by some resolvers on the order of CNAME RRsets if there is more than one CNAME in the chain. This expectation of ordering is not promised by the relevant RFCs.

100%

I just commented the same.

It's pretty clear that the "possibly" refers to the presence of the CNAME RRs, not the ordering.

  • The context makes it less clear, but even if we pretend that part is crystal, a comment that stops there is missing the point of the article. All CNAMEs at the start isn't enough. The order of the CNAMEs can cause problems despite perfect RFC compliance.