← Back to context

Comment by mattmanser

1 month ago

Are ghost bugs even real?

My first job had the Devs working front-line support years ago. Due to that, I learnt an important lessons in bug fixing.

Always be able to re-create the bug first.

There are no such thing as ghost bugs, you just need to ask the reporter the right questions.

Unless your code is multi-threaded, to which I say, good luck!

They're real at scale. Plenty of bugs don't suface until you're running under heavy load on distributed infrastructure. Often the culprit is low in the stack. Asking the reporter the right questions may not help in this case. You have full traces, but can't reproduce in a test environment.

When the cause is difficult to source or fix, it's sometimes easier to address the effect by coding around the problem, which is why mature code tends to have some unintuitive warts to handle edge cases.

> Unless your code is multi-threaded, to which I say, good luck!

What isn't multi-threaded these days? Kinda hard to serve HTTP without concurrency, and practically every new business needs to be on the web (or to serve multiple mobile clients; same deal).

All you need is a database and web form submission and now you have a full distributed system in your hands.

  • Only superficially so, await/async isn't usually like the old spaghetti multi-threaded code people used to write.

    • You mean in a single-threaded context like Javascript? (Or with Python GIL giving the impression of the same.) That removes some memory corruption races, but leaves all the logical problems in place. The biggest change is that you only have fixed points where interleaving can happen, limiting the possibilities -- but in either scenario, the number of possible paths is so big it's typically not human-accessible.

      Webdevs not aware of race conditions -> complex page fails to load. They're lucky in how the domain sandboxes their bugs into affecting just that one page.

  • nginx is single–threaded, but you're absolutely right — any concurrency leads to the same ghost bugs.

    • nginx is also from the era when fast static file serving was still a huge challenge, and "enough to run a business" for many purposes -- most software written has more mutable state, and much more potential for edge cases.

Historically I would have agreed with you. But since the rise of LLM-assisted coding, I've encountered an increasing number of things I'd call clear "ghost bugs" in single threaded code. I found a fun one today where invoking a process four times with a very specific access pattern would cause a key result of the second invocation to be overwritten. (It is not a coincidence, I don't think, that these are exactly the kind of bugs a genAI-as-a-service provider might never notice in production.)