Comment by ploxiln
9 years ago
That's absolutely true. When it comes to CPU/memory, skilled software engineers always think, "it must be my bug, it always is".
So in that super rare case of actually running into a CPU defect, it's a mindfuck, it'll drive you crazy. You'll be looking for the flaw in your algorithm which makes it fail once a week under production load. But you just can't find it, it makes no sense ...
(When it comes to drivers for network/storage/graphics etc devices, it's a whole different story. Those things are piles of bugs that need work-arounds in drivers.)
A little anecdote describing one such bug. I didn't find this, another one of my teammates did.
The symptom was that a board with a specific microcontroller on it would be working fine, then after a power cycle it might not keep working. A flash dump would show that the reset vector, the first byte of flash on that system, would be all zeroed out. Of course the system would not run anymore, but why did it happen? After months of intermittent debugging and trying to reproduce the cause was determined. At least under certain conditions the brownout detection level was lower than the voltage level that caused the CPU to make errors. If the board lost power slowly then the CPU would start executing corrupted / arbitrary instructions which generally included lots of zeros. It would occasionally write zeros to the zero address, bricking the board.
Since then we have external power monitoring and reset circuits on all the new boards, but existing ones needed a fix. Luckily the board had power failure interrupt connected, so when that triggers we reconfigure the CPU to execute on the slowest possible clock rate, which greatly reduced the occurrence.
Yes, we have seen similar in AVRs (328p) where the lowest voltage brownout threshold/trip is not enough to prevent the EEPROM and/or Flash from getting corrupted somehow...
Similarly: Kernel bugs! Once upon a time I spend a good week trying to reproduce a rare crash one of our users saw ever so often (just enough to get our attention, but not enough to get a repro case). Stack traces made no sense and just in general the whole crash made no sense the more I looked at it. Turns out it was a bug in XNUs pthread subsystem, a part where I would have never looked if it weren‘t for desperation, because, well, it‘s the kernel, it works, right?
Emacs would regularly panic the kernel on one OS X version when I opened preview to look at a pdf and had TeX recompile.
The DVD player on OSX with a MacBook (polycarbonate ones) would occasionally cause a kernel panic if you had 4GB of RAM installed.
Or at least should have been long since spotted thanks to use and abuse.
You have some lovely stories online about these things. Like the guy tracking down a stuck bit in RAM.
Or on the network level, a VPN that failed only when traversing one possible route between company offices.
I have a very old one. I was working on a embedded system using an Intel 80186. The ‘186 had a memory mapped IO system that was very unusual for Intel x86 CPUs. The code that I started with was originally written by a bunch of Motorola 68000 guys so they used this mode. I had to modify the interrupt structure so I rewrote the interrupt service routines. Since it was memory mapped and ANDs were documented to be atomic, I assumed that I could just AND bits into the interrupt mask register. Big mistake. It turned out that ANDs to the IO registers were not atomic. And could be interrupted in the middle of the write after read. I took me about a month to realize that this was a hardware “bug” and not in my interrupt code. Drove me nuts.
These problems have existed for years - I once tracked down a dead cell in core memory on a system I worked on in the Air Force. Because of the chance alignment, it caused incoming messages to never "end". Which was fun for the operators.
http://www.1882nd.com/images/DSTE.jpg