← Back to context

Comment by ereyes01

9 years ago

A few past lives ago, I used to work on the AIX kernel at IBM. I once spent a few weeks poring through trace data trying to investigate a very mysterious cache-aligned memory corruption induced by a memory stress test. Our trace data was quite comprehensive, and is always turned on due to its very low overhead. It was concerning enough (and took me long enough) that it eventually sucked in the rest of my team to aid in the investigation. None of these other guys were noobs- a couple of them had (at the time) built over 20 years of experience in this system, and in diagnosing similar memory corruption bugs beyond any doubt (many were due to errant DMAs from device drivers). I had too, though for much less than 20 years.

After several full days of team-wide debugging, we had no better explanation based on the available evidence than cosmic rays, or a hardware bug. IBM's POWER processor designers worked across the street from us, so we tried to get them to help- first by asking nicely, then by escalating through management channels.

Their reply was more or less: we've run our gamut of hardware tests for years, and your assertion that it's hardware related is vanishingly unlikely... we don't look into hardware bugs unless you can prove to us beyond a doubt it's hardware related. Cache-aligned memory corruption without any other circumstantial evidence isn't enough.

On a crashed test system sitting in the kernel debugger for several weeks now, there would be no more circumstantial evidence beyond the traces. A corruption like this was never seen again, by all accounts.

If we were right and it was evidence of a hardware failure, this is one way such a problem can go undetected. I hope it was something else, or even a cosmic ray, but we'll never know for sure, I guess.

I understand that someone at Microsoft Research once found a bug in the XB360's memory subsystem by model checking a TLA+ spec of it. The story goes that IBM initially refused to believe the bug report. A few weeks later they admitted that such a bug did indeed exist, had been missed by all their testing and would have resulted in system crashes after about 4 hours use.

Did you by chance see the paper a year back or so outlining that memory errors are more likely to occur near page boundaries? The author's premise was that a lot of 'cosmic rays' are just manufacturing flaws.

  • I debugged lots of memory corruption errors in my time working on AIX kernel stuff. Most of the ones I got to work on did indeed happen at a page boundary [1]. I think there is a pretty simple reason for this: "pages" are both the logical unit size of memory the kernel's allocator hands out, and the unit size that the hardware is capable of addressing in most cases. Therefore, when something at the kernel level is done incorrectly, it often references someone else's page.

    It's also possible for a _byte_ offset to be wrong, and these types of errors need not occur at a page boundary. Useful things to do with a raw memory dump with this kind of corruption (at least in AIX kernel):

    - Identify the page containing the corruption, and find all activity concerning the physical frame in the traces.

    - Try to reverse engineer the bad data. Often times, there are pointers you can follow. You would have to manually translate the virtual address to physical frames, but that's pretty simple to do, both for user space and kernel space (in our case, it was always in kernel space, which was 64-bit and just used a simple prefix).

    From there, you just have to be crafty and thorough in following the breadcrumbs and either identifying the bug in code, or at least who should investigate next.

    In my original post, note that the corruption was on a _CPU cache_ boundary (128 bytes in POWER). IIRC, the containing page was allocated and pinned [2] for a longer time than the trace buffer tracked (it's been a few years, though).

    [1] To make things fun, AIX and POWER supports multiple page sizes- 4K, 64K, 16MB, 16GB. Hardware also has the ability of promoting / demoting between 4K and 64K for in-use memory... lots of fun :-)

    [2] AIX is a pageable kernel. If kernel code can't handle a page fault, it needs to "pin" the memory.

    ... note that any of this can be really outdated, it's been almost a decade since I was an expert in this stuff :-)

    (Edit: formatting... how do you make a pretty bulleted list on HN??)

    • Pretty cool to read these stories about AIX. It's always been a nice system paired with HW often a couple years ahead of the performance curve, but it certainly became more and more of a sustaining effort after 2001.

  • Shouldn't ECC offer a second form of benchmark on this? If you see transient, cosmic-ray looking errors in ECC, presumably that's much stronger evidence of a hardware bug. Of course, I've also heard it claimed that ECC design and manufacture are held to a higher standard that might hide the issue.

  • having often attributed these things to cosmic rays as well, anyone have any actual estimates on how often/likely cosmic rays are to cause errors in desktop-like setups?

    had to have been a cosmic ray at least once right?