Debugging: Indispensable rules for finding even the most elusive problems (2004)

6 days ago (dwheeler.com)

In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.

Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.

But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.

Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.

  • The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.

    • > The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them.

      So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.

      3 replies →

    • Depending on how your routine looks like, you could run both on a given input, and see when / where they differ.

Rule 0: Don't panic

Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.

  • 100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.

  • There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.

    • I read this book and took this advice to heart. I don't have a brass bar in the office, but when I'm about to push a button that could cause destructive changes, especially in prod, my hands reflexively fly up into the air while I double-check everything.

      3 replies →

    • Thank you for explaining that phrase! I couldn't find it with a quick Google.

  • I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.

    • I once worked with a company that provided IM services to hyper competitive, testosterone poisoned options traders. On the first fine trading day of a January new year, our IM provider rolled out an incompatible "upgrade" to some DLL that we (our software, hence our customers) relied on, that broke our service. Our customers, ahem, let their displeasure be known.

      Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.

      As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.

    • I always say this too. But the real trick is knowing what to let thru. You can’t just shield your team from everything going on in the organisation. You’re all a part of the organisation and should know enough to have an opinion.

      A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.

      Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.

  • A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.

    • Rollback ability is a must—it can be the most used mitigation if done right.

      Not all issues can be fixed with a rollback though.

  • I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.

    Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.

  • Also a pager/phone going off incessantly isn't useful either. manage your alarms or you'll be throwing your phone at a wall.

  • This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.

    • Yes its sometimes instinct takes over when your on the spot in a pinch but there are institutional things you can do to be prepared in advance that expand your set of options in the moment much like a pre-prepared firedrill playbook you can pull from also there are training courses like Kepner-Tregoe but you are right there are just some people who do better than others when _it's hitting the fan.

  • Uff, yeah. I used to work with a guy who would immediately turn the panic up to 11 at the first thought of a bug in prod. We would end up with worse architecture after his "fix" or he would end up breaking something else.

  • agreed. it’s practically a prerequisite for everything else in the book. Staying calm and thinking clearly is foundational

For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.

Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...

I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).

  • "git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.

    The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.

    If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.

    • Sometimes you'll find a repo where that isn't true. Fortunately, git bisect has a way to deal with failed builds, etc: three-value logic. The test program that git bisect runs can return an exit value that means that the failure didn't happen, a different value that means that it did, or a third that means that it neither failed nor succeeded. I wrote up an example here:

      https://speechcode.com/blog/git-bisect

      1 reply →

    • I do bisecting almost as a last resort. I've used it when all else fails only a few times. Especially as I've never worked on code where it was very easy to just build and deploy a working debug system from a random past commit.

      Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.

    • > why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release

      You’re spot on.

      However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.

      This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.

      It’s very annoying and wasteful. :(

      14 replies →

    • Same. Every branch apart from the “real” one and release snapshots is transient and WIP. They don’t get merged back unless tests pass.

  • Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.

  • The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?

    This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!

  • Not to complain about bisect, which is great. But IMHO it's really important to distinguish the philosophy and mindspace aspect to this book (the "rules") from the practical advice ("tools").

    Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]

    The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.

    [1] Also things like "use a time travel debugger", "enable logging", etc...

    [2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"

    • I've spent the past two decades working on a time travel debugger so obviously I'm massively biassed, but IMO most programmers are not nearly as proficient in the available debug tooling as they should be. Consider how long it takes to pick up a tool so that you at least have a vague understanding of what it can do, and compare to how much time a programmer spends debugging. Too many just spend hour after hour hammering out printf's.

      2 replies →

  • You can also use divide and conquer when dealing with a complex system.

    Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.

    That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.

    And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?

    As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.

  • git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels

  • Bisection is also useful when debugging css.

    When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.

    Rinse and repeat, it's a basic binary search.

    I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.

  • Binary search rules. Being systematic about dividing the problem in half, determining which half the issue is in, and then repeating applies to non software problems quite well. I use the strategy all the time while troubleshooting issue with cars, etc.

Make sure you're editing the correct file on the correct machine.

  • Yep, this is a variation of "check the plug"

    I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)

    • I'm glad I'm not the only one doing this after I wasted too much time trying to figure out why my docker build was not reflecting the changes ... never again..

  • How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.

  • the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.

    • Baby steps, if the foundation is shaky no amount of reasoning on top is going to help.

  • That's why you make it break differently first. To see your changes have any effect.

    • When working on a test that has several asserts, I have adopted the process of adding one final assert, "assert 'TEST DEBUGGED' is False", so that even when I succeed, the test fails -- and I could review to consider if any other tests should be added or adjusted.

      Once I'm satisfied with the test, I remove the line.

Some additional rules: - "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare. - "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those. - "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".

  • Alternatively, I've found the "Maybe it's a bug. I'll try an make a test case I can report on the mailing list" approach useful at times.

    Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.

    Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.

  • I always have the mindset of "its my fault". My Linux workstation constantly crashing because of the i9-13900k in it was honestly humiliating. Was very relieved when I realized it was the CPU and not some impossible to find code error.

  • It's healthier to assume your code is wrong than otherwise. But it's best to simply bisect the cause-effect chain a few more times and be sure.

  • About "family and friends", a couple of times by fixing minor and a priori unrelated side issues, it revealed the bug I was after.

Step 10, add the bug as a test to the CI to prevent regressions? Make sure the CI fails before the fix and works after the fix.

  • The largest purely JavaScript repo I ever worked on (150k LoC) had this rule and it was a life saver, particularly because the project had commits dating back more than five years and since it was a component/library, it had quite few strange hacks for IE.

  • I don't think this is always worth it. Some tests can be time consuming or complex to write, have to be maintained, and we accept that a test suite won't be testing all edge cases anyway. A bug that made it to production can mean that particular bug might happen again, but it could be a silly mistake and no more likely to happen again than 100s of other potential silly mistakes. It depends, and writing tests isn't free.

    • Writing tests isn't free but writing non-regression tests for bugs that were actually fixed is one of the best test cases to consider writing right away, before the bug is fixed. You'll be reproducing the bug anyway (so already consider how to reproduce). You'll also have the most information about it to make sure the test is well written anyway, after building a mental model around the bug.

      Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.

      5 replies →

    • You (or other people) will thank yourself in a few months/years when refactoring the code, knowing that they don't need to worry about missing edge cases, because all known edge cases are covered with these non regression tests.

      5 replies →

  • Yes, just more generally document it

    I've lost count of how many things i've fixed only to to see;

    1) It recurs because a deeper "cause" of the bug reactivated it.

    2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.

    I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.

  • What do you do with the years old bug fixes? How fast can one run the CI after a long while of accumulating tests? Do they still make sense to be kept in the long run?

    • Why would you want to stop knowing that your old bug fixes still worked in the context of your system?

      Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.

    • I'm not particularly passionate about arguing the exact details of "unit" versus "integration" testing, let alone breaking down the granularity beyond that as some do, but I am passionate that they need to be fast, and this is why. By that, I mean, it is a perfectly viable use of engineering time to make changes that deliberately make running the tests faster.

      A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.

      I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.

    • I would say yes, your CI should accumulate all of those regression tests. Where I work we now have many, many thousands of regression test cases. There's a subset to be run prior to merge which runs in reasonable time, but the full CI just cycles through.

      For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.

      1 reply →

    • This is a great problem to have, if (IME) rare. Step 1 Understand the System helps you figure out when tests can be eliminated as no longer relevant and/or which tests can be merged.

    • I think for some types of bugs a CI test would be valuable if the developer believes regressions may occur, for other bugs they would be useless.

If folks want to instill this mindset in their kids, themselves or others I would recommend at least

The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)

https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...

https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)

To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...

https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!

  • Agree. I think Zen and the Art of Motorcycle Maintenance encapsulates the art of troubleshooting the best. Especially the concept of "gumption traps",

    "What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."

    Words to live by

    • i'm slogging through zen, it's a bit trite so far (opening pages). im struggling to continue. when will it stop talking about the climate and blackbirds and start saying something interesting?

      2 replies →

    • Yeah, it is probably what I would start with and all the messages the book is sending you will resurface in the others. You have to cultivate that debugging mind, but once it starts to grow, it can't be stopped.

  • What part of Three Body Problem has anything to do with debugging or troubleshooting or any form of planning?

    Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.

> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.

Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.

I'd be curious if there's a single programmer that follows this advice.

  • I think we have a bit different interpretations here.

    > read everything in depth

    Is not necessarily

    > first read the entire manual of the library I'm using, all 700 pages

    If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.

  • Essentially yes, that's correct. Your mistake is thinking that the outcome of those months of work is being able to kinda-probably fix one single bug. No: the point of all that effort is to truly fix all the bugs of that kind (or as close to "all" as is feasible), and to stop writing them in the first place.

    The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.

    P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.

    • > if the manual of a library you're using is 700 pages, you're probably using the wrong library.

      Statement bordering on papyrophobia. (Seems that is a real phobia)

      2 replies →

  • great strawman, guy that refuses to read documentation

    • I mean he has a point. Things are incredibly complex now adays, I don't think most people have time to "understand the system."

      I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."

      Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)

I have been bitten more than once thinking that my initial assumption was correct, diving deeper and deeper - only to realize I had to ascend and look outside of the rabbit hole to find the actual issue.

> Assumption is the mother of all screwups.

  • This is how I view debugging, aligning my mental model with how the system actually works. Assumptions are bugs in the mental model. The problem is conflating what is knowledge with what is an assumption.

  • I've once heard from an RTS game caster (IIRC it was Day9 about Starcraft) "Assuming... Is killing you".

  • When playing Captain Obvious (i.e. the human rubber duck) with other devs, every time they state something to be true my response is, "prove it!" It's amazing how quickly you find bugs when somebody else is making you question your assumptions.

Then, after successful debugging your job isn't finished. The outline of "Three Questions About Each Bug You Find" <http://www.multicians.org/thvv/threeq.html> is:

1. Is this mistake somewhere else also?

2. What next bug is hidden behind this one?

3. What should I do to prevent bugs like this?

The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.

The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...

The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.

Also sometimes: the bug is not in the code, its in the data.

A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.

For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.

"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.

Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.

I’m not sure that doesn’t sit well with me.

Rule 1 should be: Reproduce with most minimal setup.

99% you’ll already have found the bug.

1% for me was a font that couldn’t do a combination of letters in a row. life ft, just didn’t work and thats why it made mistakes in the PDF.

No way I could’ve ever known that if I wouldn’t have reproduced it down to the letter.

Just split code in half till you find what’s the exact part that goes wrong.

  • Related: decrease your iterating time as much as possible. If you can test your fix in 30 seconds vs 5 minutes, you’ll fix it in hours instead of days.

  • Rule 4 is divide and conquer, which is the 'splitting code in half' you reference.

    I'd argue that you can't effectively split something in half unless you first understand the system.

    The book itself really is wonderful - the author is quite approachable and anything but dogmatic.

I also think it is worthwhile stepping thru working code with a debugger. The actual control flow reveals what is actually happening and will tell you how to improve the code. It is also a great way to demystify how other's code runs.

  • I agree and have found using a time travel debugger very useful because you can go backwards and forwards to figure out exactly what the code is doing. I made a video of me using our debugger to compare two recordings - one where the program worked and one where a very intermittent bug occurred. This was in code I was completely unfamiliar with so would have been hard for me to figure out without this. The video is pretty rubbish to be honest - I could never work in sales - but if you skip the first few minutes it might give you a flavour of what you can do. (I basically started at the end - where it failed - and worked backwards comparing the good and bad recordings) https://www.youtube.com/watch?v=GyKrDvQ2DdI

    • To go backwards, don't you have to save the previous states of the machine. This always seemed long a strong limitation.

  • I think that fits nicely under rule 1 ("Understand the system"). The rules aren't about tools and methods, they're about core tasks and the reason behind them.

  • That is rule #3. quit thinking and look. Use whatever tool you need and look at what is going on. The next few rules (4-6) are what you need to do while you are doing step #3.

  • Make sure through pure logic that you have correctly identified the Root Cause. Don't fix other probable causes. This is very important.

  • This is necessary sometimes when you’re simply working on an existing code base.

One good timesaver: debug in the easiest environment that you can reproduce the bug in. For instance, if it’s an issue with a website on an iPad, first see if you reproduce in chrome using the responsive tools in web developer. If that doesn’t work, see if it reproduces in desktop safari. Then the iPad simulator, and only then the real hardware. Saves a lot of frustration and time, and each step towards the actual hardware eliminates a whole category of bugs.

> Check the plug

I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.

  • Yes the human brain can really be blind when its a priori assumptions turn out to be wrong.

Over twenty five odd years, I have found the path to a general debugging prowess can best be achieved by doing it. I'd recommend taking the list/buying the book, using https://up-for-grabs.net to find bugs on github/bugzilla, etc. and doing the following:

1. set up the dev environment

2. fork/clone the code

3. create a new branch to make changes and tests

4. use the list to try to find the root cause

5. create a pull request if you think you have fixed the bug

And use Rule 0 from GuB-42: Don't panic

(edited for line breaks)

> Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs

Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.

Take the time to speed up my iteration cycles has always been incredibly valuable. It can be really painful because its not directly contributing to determining/fixing the bug (which could be exacerbated if there is external pressure), but its always been worth it. Of course, this only applies to instances where it takes ~4+ minutes to run a single 'experiment' (test, startup etc). I find when I do just try to push through with long running tests I'll often forget the exact variable I tweaked during the course of the run. Further, these tweaks can be very nuanced and require you to maintain a lot of the larger system in your head.

I’m so bad at #1.

I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…

> Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself

I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.

I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.

> Quit thinking and look (get data first, don't just do complicated repairs based on guessing)

From my experience, this is the single most important part of the process. Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.

I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.

  • There's a book I love and always talk about called "Stop Guessing: The 9 Behaviors of Great Problem Solvers" by Nat Greene. It's coincidental, I guess, that they both have 9 steps. Some of the steps are similar so I think the two books would be complementary, so I'm going to check out "Debugging" as well.

  • I worked at a place once where the process was "Quit thinking, and have a meeting where everyone speculates about what it might be." "Everyone" included all the nontechnical staff to whom the computer might as well be magic, and all the engineers who were sitting there guessing and as a consequence not at a keyboard looking.

    I don't miss working there.

I've had trouble keeping the audit trail. It can distract from the flow of debugging, and there can be lots of details to it, many of which end up being irrelevant; i.e. all the blind rabbit holes that were not on the maze path to the bug. Unless you're a consultant who needs to account for the hours, or a teller of engaging debugging war stories, the red herrings and blind alleys are not that useful later.

My first rule for debugging debutants:

Don't be too embarassed to scatter debug logmessages in the code. It helps.

My second rule:

Don't forget to remove them when you're done.

  • My rule for a long time has been anytime I add a print or log, except for the first time I am writing some new cide with tricky logic, which I try not to do, never delete it. Lower it to the lowest possible debug or trace level but if it was useful once it will be useful again, even if only to document the flow thru the code on full debug.

    The nicest log package I had would always count the number of times a log msg was hit even if the debug level meant nothing happened. The C preprocessor made this easy, haven't been able to get a short way to do this counting efficiently in other languages.

One thing I have been doing is to create a directory called "debug" from the software and write lots of different files when the main program has executed to add debugging information but only write files outside of hot loops for debugging and then visually inspect the logs when the program is exited.

For intermediate representations this is better than printf to stdout

I had the incredible luck to stumble upon this book early in my career and it helped me tremendously in so many ways. If I could name only one it would be that it helped me get over the sentiment of being helpless in front of a difficult situation. This book brought me to peace with imperfection and me being an artisan of imperfection.

I can't comment further on David A. Wheeler's review because his words were from 2004 (He said everything true), and I can't comment on the book either because I haven't read it yet.

Thank you for introducing me to this book.

One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.

#7 Check the plug: Question your assumptions, start at the beginning, and test the tool.

I have found that 90% of network problems, are bad cables.

That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.

  • I prefer to cut the connectors off with savage vengeance before tossing the faulty cable ;-)

Post: "9 rules of debugging"

Each comment: "..and this is my 10th rule: <insert witty rule>"

Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.

Rule 11: If you haven't solved it and reach this rule, one of your assertions is incorrect. Start over.

Review was good enough to make me snag the entire book. I'm taking a break from algorithmic content for a bit and this will help. Besides, I've got an OOM bug at work and it will be fun to formalize the steps of troubleshooting it. Thanks, OP!

  • I recommend this book to all Jr. devs. Many feel very overwhelmed by the process. Putting it into nice interesting stories and how to be methodical is a good lesson for everyone.

Nice classic that sticks to timeless pricniples. the nine rules are practical with war stories that make them stick. but agree that "don't panic" should be added

Wasn't Bryan Cantrill writing a book about debugging? I'd love to read that.

  • I was! (Along with co-author Dave Pacheco.) And I still have the dream that we'll finish it one day: we had written probably a third of it, but then life intervened in various dimensions. And indeed, as part of our preparation to write our book (which we titled The Joy of Debugging), we read Wheeler's Debugging. On the one hand, I think it's great to have anything written about debugging, as it's a subject that has not been treated with the weight that it deserves. But on the other, the "methodology" here is really more of a collection of aphorisms; if folks find it helpful, great -- but I came away from Debugging thinking that the canonical book on debugging has yet to be written.

    Fortunately, my efforts with Dave weren't for naught: as part of testing our own ideas on the subject, I gave a series of presentations from ~2015 to ~2017 that described our thinking. A talk that pulls many of these together is my GOTO Chicago talk in 2017, on debugging production systems.[0] That talk doesn't incorporate all of our thinking, but I think it gets to a lot of it -- and I do think it stands at a bit of contrast to Wheeler's work.

    [0] https://www.youtube.com/watch?v=30jNsCVLpAE

    • It's a great talk! I have stolen your "if you smell smoke, find the source" advice and put it in some of my own talks on the subject.

First time hearing about these 9 rules, but I learning most of them by experience with many years resolving or trying to resolved bugs.

Only thing that I dont agree is the book cost US$ 4.291,04 on Amazon

Rule #10 - it’s probably DNS

  • I worked at place in the late 90s where that was true, at least for anything Internet related. We were doing (oh so primitive by today's standards...) Web development and it happened so many times. I'd call downstairs and they'd swear DNS was fine, and then 20 minutes to half an hour later, it would all be mysteriously working again. But only if we called down heh heh.

    On an unrelated note, one of the folks down there explained the DNS setup once and it was like something out of a Stephen King novel. They'd even been told by a recognized industry expert (whose name I sadly can't remember any more) that what they needed to do was impossible, but they still did it. Somehow.

    They really were great folks, they just had that one quirk but after a while I could just chuckle about it.

  • Years ago, my boss thought he was being clever and set our server’s DNS to the root nameservers. We kept getting sporadic timeouts on requests. That took a while to track down… I think I got a pizza out of the deal.

I would almost change 4 into "Binary search".

Wheeler gets close to it by suggesting to locate which side of the bug you're on, but often I find myself doing this recursively until I locate it.

  • Yeah people say use git bisect but that's another dimension (which change introduced the bug).

    Bisecting is just as useful when searching for the layer of application which has the bug (including external libraries, OS, hardware, etc.) or data ranges that trigger the bug. There's just no handy tools like git bisect for that. So this amounts to writing down what you tested and removing the possibilities that you excluded with each test.

Personally, I’d start with divide and conquer. If you’re working on a relevant code base chances are that you can’t learn all the API spec and documentation because it’s just too much.

> Ask for fresh insights (just explaining the problem to a mannequin may help!)

You can’t trust a thing this person says if they’re not recommending a duck.

I love the "if you didn't fix it, it ain't fixed". It's too easy to convince yourself something is fixed when you haven't fully root-caused it. If you don't understand exactly how the thing your seeing manifested, papering over the cracks will only cause more pain later on.

As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.

  • > If you didn't fix it, it ain't fixed

    AKA: “Problems that go away by themselves come back by themselves.”

This is related to the classic debugging book with the same title. I first discovered it here in HN.

I’d add “a logging module done today will save you a lot of overtime next year”.

>Rule 1: Understand the system: Read the manual, read everything in depth (...)

Yeah, ain't nobody got time for that. If e.g. debugging a compile issue meant we read the compiler manual, we'd get nothing done...