← Back to context

Comment by thegrim33

3 months ago

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

  • That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.

    • The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.

      4 replies →

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

  • That, and 50% of the machines where their heuristics say it is a hardware error fail basic memory tests.

    I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).

    (I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)

    My take is that their 10% estimate is a lower bound.

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

  • That tells you one bit was changed. It doesn't prove that single bit changed due to a hardware failure. It could have been changed by broken software.

    • [I work at Mozilla]

      Yes, that's a confounding factor, and in fact the starting assumption when looking at a crash. Sometimes you can be pretty sure it's hardware. For example, if it's a crash on an illegal instruction in non-JITted code, the crash reporter can compare that page of data with the on-disk image that it's supposed to be a read-only copy of. Any mismatches there, especially if they're single bit flips, are much more likely to be hardware.

      But I've also seen it several times when the person experiencing the crashes engages on the bug tracker. Often, they'll get weird sporadic but fairly frequent crashes when doing a particular activity, and so they'll initially be absolutely convinced that we have a bug there. But other people aren't reporting the same thing. They'll post a bunch of their crash reports, and when we look at them, they're kind of all over the place (though as they say, almost always while doing some particular thing). Often it'll be something like a crash in the garbage collector while watching a youtube video, and the crashes are mostly the same but scattered in their exact location in the code. That's a good signal to start suspecting bad memory: the GC scans lots of memory and does stuff that is conditional on possibly faulty data. We'll start asking them to run a memory test, at least to rule out hardware problems. When people do it in this situation, it almost always finds a problem. (Many people won't do it, because it's a pain and they're understandably skeptical that we might be sandbagging them and ducking responsibility for a bug. So we don't start proposing it until things start feeling fishy.)

      But anyway, that's just anecdata from individual investigations. gsvelto's post is about what he can see at scale.

    • Broken software causes null pointer references and similar logic errors. It would be extremely unusual to have an inadvertent

          ptr ^= (1 << rand_between(0,64));
      

      that got inserted in the code by accident. That's just not the way that we write software.

      4 replies →

[flagged]

  • I don't think Firefox has the access permissions needed to read MCE status, and the vast majority of our users don't have ECC, let alone they're going to run memtest86(+) after a Firefox crash.

    If they did, we wouldn't be having this discussion to begin with!