Comment by strstr

1 year ago

This isn't wrong per se, but rather, it lacks concrete recommendations for what should be done differently.

I would love to see Linux thoroughly and meaningfully tested. For some parts it's just... hard. (If anyone wants to get their start writing kernel code, have a crack at writing some self-tests for a component that looks complicated. The relevant maintainer will probably be excited to see literally anyone writing tests.)

For this particular bug, the cheapest spot to catch the issue would have been code review. In a normal code base, the next cheapest would have been unit testing, though, in this situation, that may not have caught it given that the underlying bug required someone to break the contract of a function (one part of Linux broke the contract of another. Why did it not BUG_ON for that...).

Eliminating the class of issue required fairly invasive forms of introspection on VMs running a custom module. Sure, we did that... eventually.

Finding it originally required stumbling on a distro of Linux that accidentally manifested the corruption visibly (about once per 50ish 30 minute integration test runs, which is pretty frequently in the scheme of corruption bugs).

2 comments

strstr

michaeljx 1 year ago

Could it be a memory related bug, which would not have existed in a memory safe language like Rust?

strstr 1 year ago

You are probably saying this as a troll, but I’ll bite. I mean, sure Rust would have helped.
Technically, the borrow checker and bounds checks wouldn’t have done it here (I’m aware I’m being obtuse by not just linking the bug).
Having cleaner types and abstractions would almost certainly have solved the problem though. Normal C++ would have worked as well as Rust.