Debugging an evil Go runtime bug

8 years ago (marcan.st)

50 comments

cmsimike

Marcan's attitude is great; I know of a ton of people (myself included) who would've written that article with far more complaining interleaved. Super informative as well, I learned a ton from this article (GRUB 2 feature for marking off bad RAM? Wow!). Very well written, informative, humorous, etc. Love it

jchw 8 years ago
Agreed. Marcan's a chill dude. I saw him a few times in the Dolphin Emulator IRC and he always had interesting things to say.
From my perspective this approach was pretty unique; going all the way down to debugging the hardware first may seem obvious to some, but it's a totally opposite approach to how I'd go about it. My mind would jump directly to producing a minimal test case. Would've never thought to mark off bad RAM with an obscure(ish?) GRUB 2 feature. Would've never thought to selectively flip a kernel flag for some parts of the code.
It's great to get these perspectives from people who really know how to dig down and debug deep.
- Twirrim 8 years ago
  
  From a slightly more old-school sysadmin approach, I learned to troubleshoot roughly in line with the OSI model (https://www.lifewire.com/layers-of-the-osi-model-illustrated...), starting at layer 1 (physical) and working up.
  That's not to say I spend a whole lot of time looking at the lower levels, but my quick mental checklist starts off down at physical points, and I try to quickly eliminate possibilities. In a lot of cases it's obvious it's a code / logic bug, and you can completely skip the lower layer stuff, but making it a conscious step pays off.

frikkasoft 8 years ago

That is such as well written post, and showcases some serious diagnostic skills by an experienced person.

Btw, thanks about telling me about the badram Grub 2 feature, had no idea that exited.

ereyes01 8 years ago

Man I loved the approach to narrow down the offending object file using the regex on the SHA hash of the binaries. That would've saved me lots of time hunting and guessing bugs with cscope+kdb back in my kernel hacker days!

donquichotte 8 years ago

Wow, these are some serious debugging skills. I also admire the tenacity and the will to investigate the root cause.

> I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.

I think I would have stopped there.

cfstras 8 years ago

Hah. I remember Bryan Cantrill complaining about this exact thing. Glad that it's fixed.

Turns out somebody else did, too: https://twitter.com/bcantrill/status/774290166164754433?lang...

/edit: spelling

dmitshur 8 years ago

The investigation in the linked Go issue [1] is also impressive.

[1] https://github.com/golang/go/issues/20427

smegel 8 years ago

Don't forget about this one, similar but affects BSD and still open:
https://github.com/golang/go/issues/15658

stmw 8 years ago

This is great story, probably wins the year for "best bug you've ever encountered?" question. Having implemented some weird runtimes for weird languages, I am sympathetic to Go team here -- these odd tradeoffs of pushing the envelope on OS <-> your_own_compiler interactions can trigger some wild experiences.

FLUX-YOU 8 years ago

>Since the problem gets worse with temperature, what happens if I heat up the RAM?

Neat. I wonder if that makes Rowhammer more likely to occur.

dboreham 8 years ago
Probably. Hotter usually means closer to not working for semiconductors.
- stmw 8 years ago
  
  Possibly, although hotter fundamentally means more thermal noise, which might actually reduce correlations / ability to communicate effectively between adjacent circuits.
  Think of it as SNR (signal-noise-ratio) -- increasing temperature increases thermal noise (there are other kinds), and with the same signal, it should actually reduce the efficiency of the side channel.
  But it brings up a good question, I wonder if anyone has studied this...

squeed 8 years ago

For a similar tale of vDSO getting someone in trouble, check out this fun talk "Really crazy container troubleshooting stories": https://media.ccc.de/v/ASG2017-115-really_crazy_container_tr...

igravious 8 years ago

Ninja level debugging and diagnostic skills. A fascinating read from start to finish. Bonus points for the GRUB 2 feature for masking out bad RAM blocks – still dreaming of owning a laptop with ECC memory :/

0x0 8 years ago

Setting up for a 104 byte stack seems pretty crazy, wouldn't you risk overrunning the red zone even without all that stack probing? https://en.wikipedia.org/wiki/Red_zone_(computing)

zaarn 8 years ago
The redzone is only non-explicit stack and only really matters if you violate it. If vDSO allocates stack properly, which it should considering it's an exported function, there is no problem.
- 0x0 8 years ago
  
  How does being an exported function change the rules, I thought the x86-64 ABI mandated an implicit safe 128 bytes below rsp at all times? Also, how can a vDSO function "allocate" stack? It would have to know about the current stack space as configured by the go runtime, and somehow dig into this go-runtime-specific record of the current stack limit? Isn't the only available option for any exported function just to /use/ pre-allocated stack space (by subtracting from rsp) - I don't see how it could possibly extend the pre-allocated stack.
  
  5 replies →

emmelaich 8 years ago

Such a thorough and well written write up.

To think that some experienced programmers I know declare that concurrency is easy.

EpicEng 8 years ago

It's only 'easy' because
A) other (probably better?) engineers have created abstractions for them, and
B) they've never had to debug a truly difficult issue related to concurrency

alistproducer2 8 years ago

I learned so much from that post. The author clearly love tinkering with computers. I wish I had that same leveled curiosity. Well done.

fierro 8 years ago

this is incredibly impressive

MrBuddyCasino 8 years ago
That was Captain Ahab level persistence. I wonder how long it took him.
- squeed 8 years ago
  
  I was one of the spectators on the Prometheus thread. It took him 2 days. It was insane.
  
  1 reply →

cwzwarich 8 years ago

Why doesn't the vDSO code just use MOV in its stack probe probe rather than an OR?

rocqua 8 years ago

My best guess would be to prevent the 'useless instruction' from being optimized out, but an or with 0 is still useless, and I don't see what optimizer lies between this GCC feature and the final binary.
Maybe the segfault only occurs on a write?
jicks 8 years ago
Apparently, because it's shorter [0].
[0]: See https://lkml.org/lkml/2017/11/10/348
- cwzwarich 8 years ago
  
  Why is it shorter? Both MOV and OR have one byte encodings, and with the OR you either have to use an immediate zero (which burns a byte) or materialize zero in some other way. As that email points out, the entire sequence would be shorter using a different addressing mode anyways. And a read-modify-write is definitely slower at runtime.
  
  1 reply →
amluto 8 years ago
Because gcc's -fstack-check is garbage. Gentoo Hardened should not be using it.
- exikyut 8 years ago
  
  From https://lkml.org/lkml/2017/11/10/310, discussing the disassembly:
  > This code is so wrong I don't even no where to start. ... I suppose we could try to make the kernel fail to build at all on a broken configuration like this.
  
  6 replies →
simooooo 8 years ago

This guy is a class act. Smart AF

euph0ria 8 years ago

Thanks for taking the time to do this writeup! Super fun to read and informative.

ezoe 8 years ago

Not only he does that, he also explains the debug procedure like we're 5 years old. I'm so impressed.

jbub 8 years ago

I wish i can do something like this one day :) Impressive skills by the author! Well done!

voiper1 8 years ago

An awesome mystery story!