Comment by skrtskrt

1 year ago

a bug taking a year to track down is a negative indicator of the quality of project maintenance, not the person who contributed the bug, whether it's due the code itself or the tooling and testing environments available to verify such important issues.

3 comments

skrtskrt

strstr 1 year ago

This isn't wrong per se, but rather, it lacks concrete recommendations for what should be done differently.

I would love to see Linux thoroughly and meaningfully tested. For some parts it's just... hard. (If anyone wants to get their start writing kernel code, have a crack at writing some self-tests for a component that looks complicated. The relevant maintainer will probably be excited to see literally anyone writing tests.)

For this particular bug, the cheapest spot to catch the issue would have been code review. In a normal code base, the next cheapest would have been unit testing, though, in this situation, that may not have caught it given that the underlying bug required someone to break the contract of a function (one part of Linux broke the contract of another. Why did it not BUG_ON for that...).

Eliminating the class of issue required fairly invasive forms of introspection on VMs running a custom module. Sure, we did that... eventually.

Finding it originally required stumbling on a distro of Linux that accidentally manifested the corruption visibly (about once per 50ish 30 minute integration test runs, which is pretty frequently in the scheme of corruption bugs).

michaeljx 1 year ago
Could it be a memory related bug, which would not have existed in a memory safe language like Rust?
- strstr 1 year ago
  
  You are probably saying this as a troll, but I’ll bite. I mean, sure Rust would have helped.
  Technically, the borrow checker and bounds checks wouldn’t have done it here (I’m aware I’m being obtuse by not just linking the bug).
  Having cleaner types and abstractions would almost certainly have solved the problem though. Normal C++ would have worked as well as Rust.