Comment by jrochkind1
18 days ago
> I assume some tests at least broke that meant they needed to be "fixed up"
OP said:
"However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC."
One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely.
You'd think that something this widely used would have golden tests that detect any output change to trigger manual review but apparently they don't.
Oh, they explain, if I understand right, they did the output change intentionally, for performance reasons. Based on the inaccurate assumption that order did not matter in DNS responses -- becuase there are OTHER aspects of DNS responses in which, by spec, order does not matter, and because there were no tests saying order mattered for this component.
> "The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS." [from RFC]
> However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets.
The developer(s) was assuming order didn't matter in general, cause the RFC said it didn't for one aspect, and intentionally made a change to order for performance reasons. But it turned out that change did matter.
Mistakes of this kind seem unavoidable, this one doesn't necessary say to me the developers made a mistake i never could or something.
I think the real conclusion is they probably need tests using actual live network stacks with common components, and why didn't they have those? Not just unit tests or with mocks, but tests that would have actually used real getaddrinfo function in glibc and shown it failing?
Even if there weren't tests for the return order, I would have bet that there were tests of backbone resolvers like getaddrinfo. Is it really possible that the first time anyone noticed that that crashed, or that ciscos bootlooped, was on a live query?