Comment by strstr

10 months ago

Being an upstream maintainer is incredibly under-appreciated. It’s an unfathomably hard, and somewhat thankless, job (at least if you do it well). A friend of mine was in a cab with Ted Ts’o at a conference and he was reviewing patches on his phone to keep up with the workload (or maybe he was bored who knows).

Despite incredible effort from maintainers, getting necessary changes into Linux can take forever. In the subsystem I depend on (and occasionally contribute to directly) it’s kinda assumed it will take at least a year (probably two) for any substantial project to get merged. This continuously disappoints PMs and Leadership. A lot of people, understandably, chafe against this lack of agility.

OTOH, I’ve been on the other side of kernel bugs. Most recently, a memory arithmetic bug was causing corruption, and took my team at least an engineer year to track down. This makes me quite sympathetic to maintainers demands for quality.

I’ve also been on the other side of the calibration discussions where Open Source work goes under appreciated. The irony never stops (“They won’t merge our patches!” “Are you having your engineers review theirs?”). That and the raw pipeline issues for maintainers (it takes a lot of experience to be a maintainer, which implies spending a lot of a bright engineer’s time on reviewing and contributing upstream to things unrelated to immediate priorities).

a bug taking a year to track down is a negative indicator of the quality of project maintenance, not the person who contributed the bug, whether it's due the code itself or the tooling and testing environments available to verify such important issues.

  • This isn't wrong per se, but rather, it lacks concrete recommendations for what should be done differently.

    I would love to see Linux thoroughly and meaningfully tested. For some parts it's just... hard. (If anyone wants to get their start writing kernel code, have a crack at writing some self-tests for a component that looks complicated. The relevant maintainer will probably be excited to see literally anyone writing tests.)

    For this particular bug, the cheapest spot to catch the issue would have been code review. In a normal code base, the next cheapest would have been unit testing, though, in this situation, that may not have caught it given that the underlying bug required someone to break the contract of a function (one part of Linux broke the contract of another. Why did it not BUG_ON for that...).

    Eliminating the class of issue required fairly invasive forms of introspection on VMs running a custom module. Sure, we did that... eventually.

    Finding it originally required stumbling on a distro of Linux that accidentally manifested the corruption visibly (about once per 50ish 30 minute integration test runs, which is pretty frequently in the scheme of corruption bugs).