Comment by jackalope

13 years ago

I dunno. Something about this analysis bothers me. The basic premise is that a string the length of a domain name routinely, albeit rarely, gets corrupted by one bit, causing errant DNS lookups. Then it's reasonable to assume that a longer string is even more likely to contain corruption. But if that's so, why do I almost never see any evidence of bit corruption in my web server logs? Surely the same corruption would affect other parts of the URL, and the probability should be greater due to the length. But I can't find a single example in my logs that can't be explained by human error (typos by users or developers). If bit corruption is so overwhelmingly prevalent in hostnames, but not URLs or other identifiers, I suspect it's due to a software bug somewhere.

4 comments

jackalope

njs12345 13 years ago

Presumably your web server doesn't serve quite as much traffic as fbcdn.net. The odds of such a bitflip happening are vanishingly low, so you need an incredibly large amount of traffic before you'll see such errors occurring.

phyalow 13 years ago
In my understanding of network communication and data transmission this should be impossible. All payload data and encapsulated header data etc is subject to checksums, hash's, variable encoding schemes on the wire, parity balancing, redundant bit insertion (hamming) etc. The result of which will always signal an errors presence. Even if the bit flip occurs in Primary memory surely the OS's memory management subsystem's would detect the corruption.
So for a bit flip not to be detected and remedied before the execution of an errant DNS lookup seems odd. Although I could be wrong (just a final year CS student).
EDIT: Just watched the video, originally classified it as TLDW, seems plausible.
- tjgq 13 years ago
  
  Note that no error detection code is able to detect all errors; it just lowers the probability of an error passing undetected even further. (CRCs are pretty robust in that they always detect sequences of errors with length <= N, with N depending on the particular algorithm.) With a large enough sample size, you will hit errors.
  In this case it is probably memory corruption. The OS won't be able to detect such a thing unless the memory has ECC (relatively uncommon these days). It could theoretically detect it if the memory pages were checksummed and periodically verified against the checksum, but afaik no OS does so.
- caf 13 years ago
  
  On consumer hardware, there is typically no mechanism enabled to detect errors in RAM.