Comment by AnimalMuppet
5 years ago
I've posted this story before, but it fits here rather nicely.
I had a function that looked like this:
void f() {
bool flag = true;
while (flag) {
g();
}
}
This function would sometimes exit. But that's really all there was to the function. Somehow flag was becoming false, even though nothing ever wrote to it.
So you might think about g() smashing the stack, when a variable is mysteriously changing, but you'd expect the return address to also get written, and it wasn't - the function returned from g() to f(), found flag to be false, exited the loop, and returned from f().
Eventually I got desperate enough to look at the assembly code produced by the compiler, and I became enlightened. (This was g++ on an ARM, by the way.) flag was being stored in R11, not in memory. (Might have been R12 - it's been a while.) When g() was called, f() just pushed the return address. Then g() pushed R11, because it was going to have its own variable to stash there, and then created space for its stack variables. And one of those variables was smashing the stack by 4 bytes, over-writing the saved flag value from f().
Worse, the way the stack was getting smashed was on a call to mesgrecv(). This takes a pointer to a structure and a size, but the relationship between the two isn't what you'd expect. The size isn't the size of the structure, but rather the size of a substructure within that structure. A contractor had gotten that detail wrong when they used that mechanism for IPC between two chips. (They'd gotten it wrong on the sending side, too, so the data stayed in sync.)
The net result was that the flag got cleared when four next-door-but-unrelated bytes on another CPU were all zero. It took me a month, off and on, to figure that out.
Crazy thing to go with that... if your compiled with different (more aggressive) optimization flags, it might have gone away!
It already went away when I tried to print out the address of the variable, so that I could watch it in the debugger (because, in order to take the address of it, it had to become a stack variable).
A real life Heisenbug.
https://en.wikipedia.org/wiki/Heisenbug
In the end, do you remember what tools you used to confirmed that R11 was overwritten? The tools and the path to the root cause are also quite interesting.
1 reply →