Marcan's attitude is great; I know of a ton of people (myself included) who would've written that article with far more complaining interleaved. Super informative as well, I learned a ton from this article (GRUB 2 feature for marking off bad RAM? Wow!). Very well written, informative, humorous, etc. Love it
Agreed. Marcan's a chill dude. I saw him a few times in the Dolphin Emulator IRC and he always had interesting things to say.
From my perspective this approach was pretty unique; going all the way down to debugging the hardware first may seem obvious to some, but it's a totally opposite approach to how I'd go about it. My mind would jump directly to producing a minimal test case. Would've never thought to mark off bad RAM with an obscure(ish?) GRUB 2 feature. Would've never thought to selectively flip a kernel flag for some parts of the code.
It's great to get these perspectives from people who really know how to dig down and debug deep.
That's not to say I spend a whole lot of time looking at the lower levels, but my quick mental checklist starts off down at physical points, and I try to quickly eliminate possibilities. In a lot of cases it's obvious it's a code / logic bug, and you can completely skip the lower layer stuff, but making it a conscious step pays off.
Man I loved the approach to narrow down the offending object file using the regex on the SHA hash of the binaries. That would've saved me lots of time hunting and guessing bugs with cscope+kdb back in my kernel hacker days!
Wow, these are some serious debugging skills. I also admire the tenacity and the will to investigate the root cause.
> I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.
This is great story, probably wins the year for "best bug you've ever encountered?" question.
Having implemented some weird runtimes for weird languages, I am sympathetic to Go team here -- these odd tradeoffs of pushing the envelope on OS <-> your_own_compiler interactions can trigger some wild experiences.
Possibly, although hotter fundamentally means more thermal noise, which might actually reduce correlations / ability to communicate effectively between adjacent circuits.
Think of it as SNR (signal-noise-ratio) -- increasing temperature increases thermal noise (there are other kinds), and with the same signal, it should actually reduce the efficiency of the side channel.
But it brings up a good question, I wonder if anyone has studied this...
Ninja level debugging and diagnostic skills. A fascinating read from start to finish. Bonus points for the GRUB 2 feature for masking out bad RAM blocks – still dreaming of owning a laptop with ECC memory :/
The redzone is only non-explicit stack and only really matters if you violate it. If vDSO allocates stack properly, which it should considering it's an exported function, there is no problem.
How does being an exported function change the rules, I thought the x86-64 ABI mandated an implicit safe 128 bytes below rsp at all times? Also, how can a vDSO function "allocate" stack? It would have to know about the current stack space as configured by the go runtime, and somehow dig into this go-runtime-specific record of the current stack limit? Isn't the only available option for any exported function just to /use/ pre-allocated stack space (by subtracting from rsp) - I don't see how it could possibly extend the pre-allocated stack.
My best guess would be to prevent the 'useless instruction' from being optimized out, but an or with 0 is still useless, and I don't see what optimizer lies between this GCC feature and the final binary.
Why is it shorter? Both MOV and OR have one byte encodings, and with the OR you either have to use an immediate zero (which burns a byte) or materialize zero in some other way. As that email points out, the entire sequence would be shorter using a different addressing mode anyways. And a read-modify-write is definitely slower at runtime.
> This code is so wrong I don't even no where to start. ... I suppose we could try to make the kernel fail to build at all on a broken configuration like this.
Marcan's attitude is great; I know of a ton of people (myself included) who would've written that article with far more complaining interleaved. Super informative as well, I learned a ton from this article (GRUB 2 feature for marking off bad RAM? Wow!). Very well written, informative, humorous, etc. Love it
Agreed. Marcan's a chill dude. I saw him a few times in the Dolphin Emulator IRC and he always had interesting things to say.
From my perspective this approach was pretty unique; going all the way down to debugging the hardware first may seem obvious to some, but it's a totally opposite approach to how I'd go about it. My mind would jump directly to producing a minimal test case. Would've never thought to mark off bad RAM with an obscure(ish?) GRUB 2 feature. Would've never thought to selectively flip a kernel flag for some parts of the code.
It's great to get these perspectives from people who really know how to dig down and debug deep.
From a slightly more old-school sysadmin approach, I learned to troubleshoot roughly in line with the OSI model (https://www.lifewire.com/layers-of-the-osi-model-illustrated...), starting at layer 1 (physical) and working up.
That's not to say I spend a whole lot of time looking at the lower levels, but my quick mental checklist starts off down at physical points, and I try to quickly eliminate possibilities. In a lot of cases it's obvious it's a code / logic bug, and you can completely skip the lower layer stuff, but making it a conscious step pays off.
That is such as well written post, and showcases some serious diagnostic skills by an experienced person.
Btw, thanks about telling me about the badram Grub 2 feature, had no idea that exited.
Man I loved the approach to narrow down the offending object file using the regex on the SHA hash of the binaries. That would've saved me lots of time hunting and guessing bugs with cscope+kdb back in my kernel hacker days!
Wow, these are some serious debugging skills. I also admire the tenacity and the will to investigate the root cause.
> I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.
I think I would have stopped there.
Hah. I remember Bryan Cantrill complaining about this exact thing. Glad that it's fixed.
Turns out somebody else did, too: https://twitter.com/bcantrill/status/774290166164754433?lang...
/edit: spelling
The investigation in the linked Go issue [1] is also impressive.
[1] https://github.com/golang/go/issues/20427
Don't forget about this one, similar but affects BSD and still open:
https://github.com/golang/go/issues/15658
This is great story, probably wins the year for "best bug you've ever encountered?" question. Having implemented some weird runtimes for weird languages, I am sympathetic to Go team here -- these odd tradeoffs of pushing the envelope on OS <-> your_own_compiler interactions can trigger some wild experiences.
>Since the problem gets worse with temperature, what happens if I heat up the RAM?
Neat. I wonder if that makes Rowhammer more likely to occur.
Probably. Hotter usually means closer to not working for semiconductors.
Possibly, although hotter fundamentally means more thermal noise, which might actually reduce correlations / ability to communicate effectively between adjacent circuits.
Think of it as SNR (signal-noise-ratio) -- increasing temperature increases thermal noise (there are other kinds), and with the same signal, it should actually reduce the efficiency of the side channel.
But it brings up a good question, I wonder if anyone has studied this...
For a similar tale of vDSO getting someone in trouble, check out this fun talk "Really crazy container troubleshooting stories": https://media.ccc.de/v/ASG2017-115-really_crazy_container_tr...
Ninja level debugging and diagnostic skills. A fascinating read from start to finish. Bonus points for the GRUB 2 feature for masking out bad RAM blocks – still dreaming of owning a laptop with ECC memory :/
Setting up for a 104 byte stack seems pretty crazy, wouldn't you risk overrunning the red zone even without all that stack probing? https://en.wikipedia.org/wiki/Red_zone_(computing)
The redzone is only non-explicit stack and only really matters if you violate it. If vDSO allocates stack properly, which it should considering it's an exported function, there is no problem.
How does being an exported function change the rules, I thought the x86-64 ABI mandated an implicit safe 128 bytes below rsp at all times? Also, how can a vDSO function "allocate" stack? It would have to know about the current stack space as configured by the go runtime, and somehow dig into this go-runtime-specific record of the current stack limit? Isn't the only available option for any exported function just to /use/ pre-allocated stack space (by subtracting from rsp) - I don't see how it could possibly extend the pre-allocated stack.
5 replies →
Such a thorough and well written write up.
To think that some experienced programmers I know declare that concurrency is easy.
It's only 'easy' because
A) other (probably better?) engineers have created abstractions for them, and
B) they've never had to debug a truly difficult issue related to concurrency
I learned so much from that post. The author clearly love tinkering with computers. I wish I had that same leveled curiosity. Well done.
this is incredibly impressive
That was Captain Ahab level persistence. I wonder how long it took him.
I was one of the spectators on the Prometheus thread. It took him 2 days. It was insane.
1 reply →
Why doesn't the vDSO code just use MOV in its stack probe probe rather than an OR?
My best guess would be to prevent the 'useless instruction' from being optimized out, but an or with 0 is still useless, and I don't see what optimizer lies between this GCC feature and the final binary.
Maybe the segfault only occurs on a write?
Apparently, because it's shorter [0].
[0]: See https://lkml.org/lkml/2017/11/10/348
Why is it shorter? Both MOV and OR have one byte encodings, and with the OR you either have to use an immediate zero (which burns a byte) or materialize zero in some other way. As that email points out, the entire sequence would be shorter using a different addressing mode anyways. And a read-modify-write is definitely slower at runtime.
1 reply →
Because gcc's -fstack-check is garbage. Gentoo Hardened should not be using it.
From https://lkml.org/lkml/2017/11/10/310, discussing the disassembly:
> This code is so wrong I don't even no where to start. ... I suppose we could try to make the kernel fail to build at all on a broken configuration like this.
6 replies →
This guy is a class act. Smart AF
Thanks for taking the time to do this writeup! Super fun to read and informative.
Not only he does that, he also explains the debug procedure like we're 5 years old. I'm so impressed.
I wish i can do something like this one day :) Impressive skills by the author! Well done!
An awesome mystery story!