Comment by jldugger

9 years ago

Did you by chance see the paper a year back or so outlining that memory errors are more likely to occur near page boundaries? The author's premise was that a lot of 'cosmic rays' are just manufacturing flaws.

I debugged lots of memory corruption errors in my time working on AIX kernel stuff. Most of the ones I got to work on did indeed happen at a page boundary [1]. I think there is a pretty simple reason for this: "pages" are both the logical unit size of memory the kernel's allocator hands out, and the unit size that the hardware is capable of addressing in most cases. Therefore, when something at the kernel level is done incorrectly, it often references someone else's page.

It's also possible for a _byte_ offset to be wrong, and these types of errors need not occur at a page boundary. Useful things to do with a raw memory dump with this kind of corruption (at least in AIX kernel):

- Identify the page containing the corruption, and find all activity concerning the physical frame in the traces.

- Try to reverse engineer the bad data. Often times, there are pointers you can follow. You would have to manually translate the virtual address to physical frames, but that's pretty simple to do, both for user space and kernel space (in our case, it was always in kernel space, which was 64-bit and just used a simple prefix).

From there, you just have to be crafty and thorough in following the breadcrumbs and either identifying the bug in code, or at least who should investigate next.

In my original post, note that the corruption was on a _CPU cache_ boundary (128 bytes in POWER). IIRC, the containing page was allocated and pinned [2] for a longer time than the trace buffer tracked (it's been a few years, though).

[1] To make things fun, AIX and POWER supports multiple page sizes- 4K, 64K, 16MB, 16GB. Hardware also has the ability of promoting / demoting between 4K and 64K for in-use memory... lots of fun :-)

[2] AIX is a pageable kernel. If kernel code can't handle a page fault, it needs to "pin" the memory.

... note that any of this can be really outdated, it's been almost a decade since I was an expert in this stuff :-)

(Edit: formatting... how do you make a pretty bulleted list on HN??)

  • Pretty cool to read these stories about AIX. It's always been a nice system paired with HW often a couple years ahead of the performance curve, but it certainly became more and more of a sustaining effort after 2001.

Shouldn't ECC offer a second form of benchmark on this? If you see transient, cosmic-ray looking errors in ECC, presumably that's much stronger evidence of a hardware bug. Of course, I've also heard it claimed that ECC design and manufacture are held to a higher standard that might hide the issue.

having often attributed these things to cosmic rays as well, anyone have any actual estimates on how often/likely cosmic rays are to cause errors in desktop-like setups?

had to have been a cosmic ray at least once right?