I've posted this story before, but it fits here rather nicely.
I had a function that looked like this:
void f() {
bool flag = true;
while (flag) {
g();
}
}
This function would sometimes exit. But that's really all there was to the function. Somehow flag was becoming false, even though nothing ever wrote to it.
So you might think about g() smashing the stack, when a variable is mysteriously changing, but you'd expect the return address to also get written, and it wasn't - the function returned from g() to f(), found flag to be false, exited the loop, and returned from f().
Eventually I got desperate enough to look at the assembly code produced by the compiler, and I became enlightened. (This was g++ on an ARM, by the way.) flag was being stored in R11, not in memory. (Might have been R12 - it's been a while.) When g() was called, f() just pushed the return address. Then g() pushed R11, because it was going to have its own variable to stash there, and then created space for its stack variables. And one of those variables was smashing the stack by 4 bytes, over-writing the saved flag value from f().
Worse, the way the stack was getting smashed was on a call to mesgrecv(). This takes a pointer to a structure and a size, but the relationship between the two isn't what you'd expect. The size isn't the size of the structure, but rather the size of a substructure within that structure. A contractor had gotten that detail wrong when they used that mechanism for IPC between two chips. (They'd gotten it wrong on the sending side, too, so the data stayed in sync.)
The net result was that the flag got cleared when four next-door-but-unrelated bytes on another CPU were all zero. It took me a month, off and on, to figure that out.
It already went away when I tried to print out the address of the variable, so that I could watch it in the debugger (because, in order to take the address of it, it had to become a stack variable).
I love these toughies, especially the full ssh one! A true debugging wizard.
“Dumb” problems can happen to anyone too. I once walked by the desk of$well_known_open_source_developer who was struggling with a mysterious bug. He’d narrowed it down to the specific function and was groveling in the function setup code (what the compiler generates before your code is called) He asked me to take a look and within seconds I saw an uninitialized variable being read.
This is not because he was a bozo! He had decades of experience. It’s simply that sometimes we get slightly wedged and can’t see the thing that is “staring us in the face”. He was embarrassed (so not mentioning his name) but he should not have been. If anything it simply proves that it can happen to anyone.
Related to this: at one organization my debugging skills were (spoiler: undeservedly) legendary...literally word got around until some new hire asked me about it months later.
Why? I came in one morning to find some folks trying to get some new model of terminals to work with the mainframe. Back then you needed the right combo of byte length, stop bits etc. they asked me if I could fix it and I said sure. As one does I poked at the setting switches and walked off to get my coffee So I could come back and think clearly. By the time I came back all the terminals were in use so I just went on with my day.
Apparently I had randomly toggled the necessary bit. But the way the story was told: I had walked in, agreed to help, rubbed my chin then simply pushed the right button and walked off without another word. Which in some sense is true, But gave me completely undeserved credit.
When I was a kid (teenager) I worked at an indoords shooting range, mind you, not real guns but just BB guns. I was supervising a bunch of school kids do some practice shooting at biathlon targets (just 10m however) and one of them had an issue with the gun with the pellet getting stuck somehow. I had a look at the gun, sorted the pellet out and fired the gun off the hip and hit a bullseye without aiming. Pure luck of course but the kids were like "woaaah" and of course I never told them that it was just luck and not my mad leet shooting skills XD
on the subject of "dumb problems" and undeserved acclamation:
One day I walked into the break room and observed one of the dev team leaders pissing on about how the vending machine didn't have the snack he wanted. I tapped on the glass right in front of him and he was instantly chagrined at the appearance of what he was missing right in front of his face.
This happens often enough I have some stock humor saved up for the occasion. In a serious tone I told him not to fret, I had observed this was a common problem with good developers who don't take enough breaks and with a little self-examination he could overcome it. In a kidding tone, I told him he was obviously stuck in trap of working too hard to work smart. "Just check your assumptions with a Pareto graph and a lot of life's little dilemmas will fall apart into easy pieces," I sez with a chuckle.
Later on it turned out he had told his team and they had a chuckle over how I enjoyed kidding them. Then they turned to the problem at hand and my guy had an aha moment and realized one of the grounding assumptions they had since literally the beginning of the project was subtly off. The week before I had solved someone's Java boxing/unboxing problem literally by accident in a single glance at the debugging trace whilst kidding him over his attempt to cast Integer into Long (int promoted to Integer in a library call). The team added up all my bad jokes and occasional "accidental" help, and in a collective Aha decided I was some kind of software engineering guru (I'm not - just a lot of painful painful experience to make light of).
And for a few months thereafter the devs would intently ponder everything I said. That quarter I actually got devs assigned to my issues because they all clamored to rub up against my supposed enlightenment, heheh. After I realized what was going on I got my top ten addressed and then suggested they needed to apply their learning to other people's issues...before they discovered my feet of clay...
Here's the craziest one that actually happened to me.
The company I worked for had installed what's best described as a mini-supercomputer (though we avoided the term) at a site in Boulder. We started getting reports of failures on the internal communication links between the compute nodes ... only at high load, late in the day. Since I was responsible for the software that managed those links, I got sent out. Two days in a row, after trying everything we could to reproduce or debug the problem, I got paged minutes after I'd left (and couldn't get back in) to tell me that it had failed again.
Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that. It wasn't the problem.
What it ultimately turned out to be was airflow and cooling. The air's thinner up there, so it carries less heat. But it wasn't the processors or links that were getting too hot. It was the power supply. When a power supply gets warmer it gets less efficient. Earlier in the day or with shorter runs as we tried different things this wasn't enough to cause a problem. With it being warmer later in the day, continuous load for longer periods was enough to cause slight brown-outs, and those were making our links flaky. And of course it would always restart just fine because it had cooled down a bit.
The fix ended up being one line in a fan-controller config.
I had a loaner machine (RS-6000 minicomputer) that would have unrecoverable ECC errors when the cover was on. The tech would come and try to diagnose it, but with the cover off, everything would work fine. He'd swap the memory anyway and put the cover back on. within a few hours the memory bank would be failing again. Turned out the machine had been a loaner in a lab where it had acquired some alpha-emitting goo on the inside of the side panel. The lab had just run it with the side panel off to solve the problem, never noticing the goo, never mentioning it to IBM when they packed it up to ship.
It's a long story but the gist is after multiple board swaps, realizing we'd isolated the panel as the fault, I noticed the goo and on a hunch checked it with a scintillator, deducing it was alpha when cardboard blocked it. Turns out the ultra-precious-metal IBM heat sink on the board had an open path that effectively channeled the alpha particles into one of those multi-chip carrier thingies, which featured exposed chips.
As for why I had a scintillator lounging in my desk at a portfolio management company, don't ask. Let's just note the iconic IT anti-hero of that era was the Bastard Operator From Hell, and leave it at that.
Unrelated to a strange bug story or anything but you just reminded me of when I was also helping someone set up a, as you called it, mini-supercomputer. It was to do quantum simulations. We were setting it up and the researcher who was going to use it made the root user name skynet. Now I know that joke has probably been played out at campuses around the world but it just seems unnecessary to tempt the fates like that.
> Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that.
Wow, I sense a more interesting story in here. Care to reveal how it was first found out and how common it actually is?
In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more of a thing at higher altitude because of less atmosphere. It's rarely a problem at sea level. At higher altitude you really need to use ECC memory, and do some sort of scrubbing (in Linux it's called Error Detection And Correction or EDAC) to correct single-bit errors before they accumulate and some word somewhere becomes uncorrectable.
The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).
"Fail on certain moon phases" reminds me of a C++ bug I encountered while trying to set up the demo for PSIP (Digital TV Guide) destined for NAB in Las Vegas. We had programming schedules resembling excel spreadsheets and my job was just to create a good one for the demo. I would spend all night making one and sent it to my boss and each morning would get in trouble for sending in blank schedules and had no idea why. On one occasion I happened to be editing at 3am and noticed all of my edits rolling back one by one. It was actually viewable on the screen as if someone took control of excel and was rolling back each field. My immediate thought was I really need to get some sleep but later we found the auto-save feature inverted itself after 3am exactly and would go through each delta one by one rolling itself back as it had been edited. The bug was found in the calculation of the vernal equinox which moves from 3am to 9pm to 3pm. Since it was triggering the leap year code 6 hours of time would get rolled back edits and all! This was of course 2008 year of the digital transition from analog cable which happened to also be a leap year.
I can't scan documents when my daughter is asleep. When she is awake, all is fine, but the minute she goes to sleep, and I'd like to use my free time to scan documents and suchlikes, forget it. I could still print documents on the same device though. Here's what I found:
The printer-scanner was connected to wi-fi. The wi-fi router was in my daughter's room, as that is where the cable socket was, tucked just behind a bench in her room. It was also near that bench that her baby monitor camera was standing. It wasn't wi-fi connected, but for whatever reason it interfered with the wi-fi signal. Same with the receiver, if I put it near my laptop, the wi-fi connection would die.
The monitor was off most of the time, and on precisely when my daughter was asleep.
As for why I could still print, just not scan: presumably that's something to do with the bandwidth, I'm guessing it took more wi-fi bandwidth to send a scanned image than to print a document (I never printed pictures on this printer).
Yeah, baby monitors are THE worst behaving radio devices. They would be second after malfunctioning neon sign transformers, but transformers are not intended as radio devices.
As for why scanner and not printer losing connection - probably printer has small buffer and scanner doesn't pause when that send buffer is full due to trying retransmissions, but stops completely. Printer probably can wait for more data.
Still... my thinking is, I'd rather a crappy baby monitor than a badly-secured device connected to wifi, and by extension, the whole world.
We do actually have a spare camera we sometimes take away, that works on Wi-Fi. I accidentally left it once at my parent's place, then travelled back home overseas. I got a notification of my baby crying desolately as I got off the plane. Turns out it was my family arguing something in the room where my daughter was staying, and the camera somehow switched itself back on. It was freaky!
The main reason 2.4GHz band is unlicensed (anyone can broadcast on it without e.g. a ham license) is because it's garbage. Baby monitors and microwave ovens are two of the most common offenders for dumping large amounts of noise into that band. If you can move the printer to a 5GHz band, that could help a lot.
And in the near future, wifi 6E will add a ton of spectrum, allowing devices to just avoid noisy channels. But to use that you'll have to upgrade the printer or add a wifi 6E bridge.
Had a Customer who complained that the Wi-Fi in one of their conference rooms was "always" unreliable. I checked it multiple times and didn't find any problems-- low SNR, strong signal, low airtime utilization.
Eventually I learned out that "always" meant a particular recurring lunchtime meeting, scheduled right when a steady stream of workers were going into the break room across the hall and heating food in a microwave oven.
I had a Panasonic 2.4Ghz jamming device. Could take out channels 1-11 all at the same time. It wasn't supposed to be a jamming device, it was supposed to be a cordless phone. Setup wifi for a customer, and it worked fine when I was on site. Later the customer complained it was completely unusable, in what they thought was a random fashion.
Turned out whenever they used the cordless it pretty much made so much noise on the 2.4G frequence wifi wouldn't work. Phone itself worked fine. Had them get new phones and the problem went away.
You really want to think about moving the Wi-fi out of the baby's room. Get a GQ-390 meter and you will see the torrent of dangerous radiation flooding her room, above recommended levels.
I was able to get my Internet provider to relocate the device to my basement.
My favourite crazy bug was during a university course on autonomous robotics. One of the other groups was using a a metal castor at the back of the robot along with 2 driven wheels. After a little while their robot would completely crash and stop responding.
I'd previously encountered a similar issue which was due to the library code we'd been given which opened a new /dev/i2c file for each motor command, eventually exceeding the max file handles and killing the program. So I assumed it was something sensible like that.
Some time later they got all excited and called us over to explain the real reason it was crashing. Their robot would initially work fine for a reasonable period of time. Then when the robot drove over the metallic tape on the floor of the arena it would die. The robot must have been building up a static charge while moving around which would eventually be dissipated when the metal touched the tape.
I wouldn't have believed it had they not setup two tests, one outside the normal arena and one inside. Changing the metal castor for a bit of lego fixed the problem.
"Surprisingly, we have also seen this issue connected to gas lift office chairs. When people stand or sit on gas lift chairs, they can generate an EMI spike which is picked up on the video cables, causing a loss of sync. If you have users complaining about displays randomly flickering it could actually be connected to people sitting on gas lift chairs. Again swapping video cables, especially for ones with magnetic ferrite ring on the cable, can eliminate this problem. There is even a white paper about this issue."
Risks Digest was on my daily reading list for years. I've been in computing since the early 70s, the history of computing is fabulous background to inform e.g. debugging, or SRE.
I've seen this one on twitter https://twitter.com/royvanrijn/status/1214162400666103808?la... There's probably some civilizational complexity limit where the unexpected interactions between seemingly isolated pieces of tech become so bad that we cannot introduce anything new without introducing legion of weird bugs.
>There's probably some civilizational complexity limit where the unexpected interactions between seemingly isolated pieces of tech become so bad that we cannot introduce anything new without introducing legion of weird bugs.
Come to think of it, I believe that meme was wondering about on UseNet from before the early days. I think Vernor Vinge alluded to it in one of his novels, (paraphrasing here) some interplanetary civilization crashing because in-transit space traffic was so dense no new launches could occur in a useful time frame due to safety-lock outs, and they didn't want to accept the risk in changing the safety margins...
One time I was writing some code in C. I found a bug, the solution seemed pretty obvious, so I fixed it, recompiled the code, and ran it again. The bug was still there.
I took a look at the rest of the code in case that I missed something. I couldn't find anything, so I added a few print statements and recompiled. I ran the code and nothing came up.
Interesting, apparently the code is not executing the branches it should. I verified the input data and code. It didn't make sense, there had to be some serious bug there that I didn't consider. I added a bunch more prints.
Recompile and execute. Still nothing. Wait a minute, THAT doesn't look good. I added a print statement right at the entry point of the program. Nothing.
At this point the root problem became apparent; my changes just weren't getting compiled. Phew, problem solved! I cleaned all the cached files and recompiled the source code. Those print statements still weren't coming up.
At the end I had to move my source code to another machine and compile it there to get it working. I suspect some global variables or path trickery to be involved, but up to this day I still haven't got a clue what was wrong, or have I seen it happen again.
Ahhahahaha. I have a similar story to that, but I eventually realized what happened.
I forget which command it was exactly, but I rsync'd or something to get a new code directory, and with the backup options in use the directory I was in got renamed.
But I still had command prompts open in that directory. And all of the files were there. So I didn't realize that one directory was not equal to the others even though it had the same name (it was a subdirectory) and appeared to have the same files.
One similar story: I was maintaining some C++ code that had a few #ifdefs in it. Someone reported a problem.
I put a breakpoint on the calling code and traced into my code. It went into the #ifdef code I expected, but the problem persisted.
Just to double-check, I let the program run until it hit the breakpoint again and traced in, but this time, it went into the #else code! That code should have been removed by the preprocessor, yet here I was, currenting stepping through it.
After questioning my understanding of the C preprocessor (and my own sanity), I luckily noticed that the module name in the debugger was very slightly different in the two cases mentioned above.
The world finally made sense again. My code was in a header file that was compiled (with different symbols defined) into two different modules and both of those modules were loaded into the same process. When I set the breakpoint in the debugger, it silently set breakpoints in both modules.
Obvious in retrospect, but very surprising to my inexperienced past self:
I'd been working on some C code for an hour or two. It wasn't behaving how I expected it to (and at the time I knew nothing of debuggers), so I added a print statement and recompiled. I got a compilation error: something like "Syntax error on line 123: #incl5de <stdio.h>". Shocked, I scrolled to that line in my text editor to fix the typo, but it wasn't there. I compiled the same code again and there were no errors.
Turns out there wasn't a bug in my code! I immediately shut down my computer because my RAM was going bad. To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
The RAM going bad in my PC was one of the most annoying issues I had to troubleshoot. It started with having my firefox pages randomly crash. first occasionally and then several times a day.
I then started getting errors when playing games with obscure error codes - which yielded nothing when searching them up.
I eventually found a comment in a thread about the crashing game that the RAM could be bad. I ran some diagnostic tests and with the number of errors that came up I was surprised my computer worked at all
> To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
With my current computer I overclocked the RAM to the best config I could get memtest to run without errors over the night. The RAM also has ECC and there were no problems reported during normal operation, even when re-compiling (most) packages. But when I got to compiling LLVM the system would crash shortly after logging ECC errors. Backing off the overclock a bit fixed that. So the rate of memory errors can definitely depend on what you are running.
> To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
Yup. I once had a machine that would freeze up when I ran package updates. I thought (of course), that the package manager had a bug. Turned out, running the upgrade was the only thing memory-intensive enough to use the faulty memory that I'd installed. After all, on a light system you can totally boot and run in <1GB of RAM...
I remember reading one years ago where someone had a problem installing new software on some embedded device - whatever they did it came up "checksum is bad".
After much testing they eventually realised that the checksum literally was the hex "bad".
I have two favourite bugs, one weird and one dumb.
Weirdest one was an IDE where the colorizer gave up on source lines longer than 998 chars. Instead it rendered the whole line as background, i.e. invisible. I once wasted two hours debugging a program with an invisible line of code!
The dumbest was a postage billing system for a bank using a third party Print-and-Mail company. Somehow the billing system went live adding the previous day's total postage costs to itself, then adding the new day's postage. These expontentially growing totals were then paid automatically by the accounting system each night... So it goes live, ... and a week later Finance gets an alert the account is overdrawn... They actually paid out nearly $1b in postage costs before hitting their internal credit limit with the bank's treasury.
My most recent bizarre bug: a coworker came to me with a bug where no matter what he tried, he could not get an if some_var is null to be true. The debugger would show the value was null. The logger showed the value was null, but the if statement would not work. After a morning of trying to fix it, he asked if I would take a look. I told him to put the null in quotes in the if. It worked. Turned out a JavaScript library had a bug where it would use the string "null" instead of null.
There's a popular progressive API I've been forced to work with that uses "true" and "false" instead of their respective booleans. The most egregious that I've ever seen in this class of errors was an API (from Google!) that returned " false".
My favorite was when I was working at SGI after it had taken over Cray Research. I was one of the lowly Cray guys in Wisconsin working with the wonderkind in California. I was to run the regression tests on the chip being design in California using some software that they had provided. I would run the tests, but some days they would crash in the middle of the night. Then the California guys would be angry that they got no tests results. I started debugging the code and got to a program called lswalk that would dole out jobs to the dozens of servers to be 'run'. The code was written by a hot shot young MIT graduate, but I was sure that the problem was with this code. I got the source code and started looking for problems and one thing I found was that if one of the servers resplonding with error the code would print out an error message. One problem though... The error string printed had an uninitialized string, so that when the printf routine would search for an end of string that was never there, probably overwriting buffers and crashing code all over the place. So one lesson is that even the best and brightest make mistakes.
Sometimes I wonder how we accomplished anything in those days with software that had so may trap doors beneath it.
The very first time my company farmed me out to work onsite with a client. Day 1, Job 1: download the client's website code and get it running on my laptop.
... It just wouldn't work. Everything I tried - failed. Everything else on my laptop was working fine, except this code. Everyone else who had ever downloaded the code had managed to get it working on their machines within a couple of hours. Colleagues working onsite with me tried to help, but everything they tried - failed. Finally the decision was taken to reset my laptop to factory defaults and reinstall everything. That took up half of Day 2. Tried to get the client's site running - failed! Things were beginning to get really embarrassing - all this was happening in full view of the clients. In desperation, my company called me back to their offices and issued me with a new laptop. Back onsite, the code downloaded ... and worked first time!
Turned out that the issue was that my hard drive filesystem had been setup (not by me!) as case-sensitive, and the client code included a file with an all-caps filename, which the code called using a lowercase string. Almost lost my job over that one.
My weirdest that I can recall right now was a PDF file that would not print. Since the printer was typically unhelpful with the error message, as was support (this is a room-sized commercial printer but we didn't get the help I'd really expect), I had to dive into it myself.
Long story short, whatever had produced the PDF had also embedded a TrueType font where one character was named //something. This is fine. The character just has a weird name, but it works. It's technically up to spec AFAIK, and I got it out of the PDF with ttfdump to have a look at it.
Well the printer's internal RIP, unknown to us, converted the PDF to Postscript when rasterising. And //something is called an "immediately evaluated name" which I forget the details for, but basically this font character, interpreted as postscript, was causing a lookup for a named variable which did not exist. Hence the crash.
I had a similar one where Adobe InDesign had been used to make a PDF where someone had selected the words to change the font, but not the spaces between (or perhaps they did, and it was a bug). This meant that the PDF included a subset font that only included the space character. Since the space character is not drawn, this resulted in a 0 byte long glyf table. Based on my reading of the TrueType font spec at the time, this isn't really proper.
Printer didn't like that one bit and died as it does to anything that smells slightly wrong. Adobe said it was fine and up to spec though, apparently 'TrueType' has a different meaning inside a PDF :)
My favorite out of these is the 500 miles email limitation one. I work mostly on big bulky manufacturing equipment but my job is to abstract out the computing part. This story reminds me that every time I want to do something I am still limited by physics. I am reminded of this story whenever the hardware people ask me to insert an artificial delay in computation.
Yes the old limits of the time style bugs are great. SOme arbitrary limit or variable to hold a value deemed way more than enough at the time, for years/decades later to jump out and catch you out. Y2K was one of those well known ones, but been many of those types of bugs.
I love programmer stories like this. My favourite personal experience was on my first Ruby on Rails project after first moving to London. I was pretty green at the time, having had only a few years of PHP experience under my belt and little else.
We had to build a Rails app around a poker game. We didn't own the source to the poker game or its API, but we had to embed it nonetheless. We had this really strange issue where some people, under a certain circumstance, couldn't get into the game. It would just boot them out. Me and my team mate must have poured through the Ruby code dozens and dozens of times and found no evidence of this bug, no ability to reproduce it; bearing in mind I was still learning the ropes and jumping head first into an unfamiliar codebase is quite daunting.
Eventually I decide to get my hands dirty and I start poking into this game engine. We embedded it as s flash widget, but the server doing most of the work was written in a mix of C++ and Python. I didn't fully comprehend what I was looking at but, even though things looked suspicious, I couldn't put my finger on an actual problem until I looked at the API written in Flask and noticed that one line of code didn't look like any other.
some_value = params['some_key']
If the request didn't contain the parameter `some_key` then this would raise a KeyError.
After maybe three solid weeks of trying to debug this thing, I submitted a one line patch:
some_value = params.get('some_key')
It's not quite as weird or as fun as most examples but for me personally it was such a great lesson in debugging and being curious about unfamiliar stuff, rather than closed off or afraid.
We were using mSQL in the 90s for web projects. A very important customer wanted a "real" database so we bought DB2. Because we didn't have an IBM plattform or Solaris we went with Windows NT.
Everyhing went fine, until one day we recognized the website being slow. Investigating brought the database as the culprit. So I went there and logged into the NT box in the data center and checked the DB2. Everything was fast. Back to my desk and the database was slow again after some time. Back to the NT server and the same thing happend.
After quite a long time I found the real culprit. The NT pipes GL software render screen blanker. After some time without interaction the screen blanker started up and took all the CPU. So the database and the website went slow. Someone had set the screenblanker to the nice GL pipes renderer.
[Searching the web, IBM introduced DB2 for Windows NT 31.10.1995 and I went to Cebit that year to check it out]
This reminds me a nice day we spent at customer's premises trying to figure out why DB2 won't install or start properly on a win2k box. Weird error messages etc. Problem was that it didn't like that the box was named 'DB2'...
We're actually working on a collection of such stories internal to our division. We've found that these tales are a great way of helping people understand the complexities and quirks of our nearly 3 decade old code base.
I think story telling is an underrated technique in our profession.
In all projects there are coding rules (like "make destructors noexcept"). A rule sticks much better if you also tell a story about some debugging caused by not following the rule.
I worked at a place one that had a style guide for the main front-end language used with links to terrible things that had happened as a result of breaking the style guides rules.
It was surprisingly effective, so I completely agree with you on the story-telling.
First month or so at my new employer, big consultancy firm for a financial institution.
Had a fairly complex distributed monolithic application integrated with Tibco EMS, Oracle DB and distributed XE transactions.
Regularly, but randomly, in production, after receiving a good amount of messages in the input queues, (which then got rerouted to other event queues for parallel processing) some DB transactions simply were getting stuck. Not rolled back, but stuck in limbo -- after a while the DB simply refused new transactions because so many were stuck.
Nobody got a clue on why that was happening, it meant regular manual restart of the services and re-feeding of the failing messages.
Users started to get fed up and the project threatened to fail.
Got into it, after couple of weeks of investigations and trial and errors with all possible weird flags, turned out that the version of Tibco EMS had a wierd behavior with distributed transaction when the queues got full of messages (queues had 50MB size limit).
Instead of rolling back gracefully the JMS+JDBC XE transaction, it...kinda exited with an IO error.
Turned out that newer versions of Tibco EMS fixed that issue, but no way to ask ops to install that new version.
Since upgrading was out of the question, the actual fix was to enable message compression to limit the size of the messages coming into the queues, turned out that the XML we sent there were up to 1.5MB (!)
After discovering that, became basically a war hero and respected by the client as the "savior of the project". Good times.
Your compression workaround reminded me of an issue I ran into a while back.
My team at work uses a reporting tool for vulnerability assessments and pen-tests; basically you can import a bunch of data files, review it in the web app, and generate a report.
I would run into cases where I couldn't upload one of my data files. The web app is JS-heavy, lots of things going on in the background without much visible feedback. It turns out that the programmers had implemented the upload as this async task with a hard-coded timeout for completion, and they likely wrote it while they had great network speed.
I'm on DSL, and generally, it gets the job done. However, upload speed is only 1Mbit/s, so with a big file, my upload would time out. It's hard-coded remember, so it didn't matter that it was still functioning when it got clobbered.
It occurred to me that some file formats, like WAR or Office documents, are basically Zip archives under the hood, so I put my large XML file into one, and tried that.... and it worked! Something on the back-end quietly unzipped my upload and imported the file it contained.
Funnier is that when I mentioned it to the devs, this behaviour was not something they expected. Probably built into a library they use.
This one happened last night. A student contacted me because her Anaconda Jupyter notebook (installed on her laptop) just wouldn't connect to the Python kernel. (The notebook itself would load, though, meaning that the server was running fine. It's just that the kernel and its websocket was failing.) I should point out that, because of COVID-19, this troubleshooting was over Zoom, which complicated the diagnosis a bit.
She had not been using Jupyter for several months, as we have been writing stand-alone programs in class using Spyder (the editor that comes with Anaconda), and the command line, and Jupyter had worked the last time that she tried it.
We restarted everything, and still the problem was there. I helped her to update everything, but that didn't solve the problem.
Finally, I looked at the error messages in the console where the Jupyter server is running. It had a huge list of errors, all relating to the pickle library.
We had done an exercise with pickle in the class, but nobody had reported a similar problem. When we looked in her classwork directory, though, we saw that she had created a "pickle.py" file when she was testing something with pickle. But, at that point in the class, we were working in the command line, and everything (including Spyder) still worked just fine.
Evidently, this was the cause of Jupyter's problem. When trying to start the Python kernel in Jupyter, it imported pickle, and evidently it imported her test file rather than the actual library. The fix was simple: we renamed her test file, and everything worked perfectly.
Had a flaky unit test that would randomly fail with some random Chinese character in the output.
The test was running a log parsing tool against a temporary file that had a pseudo-SQL syntax where you could “select ... from c:\Users\...\temp\abcd1234.xyz\testdata.dat”. The temporary directory was a randomly generated name so that the folder was guaranteed to be empty before every execution of the test.
The test failed on the rare occasion that the randomly generated temp dir consisted of the letter ‘u’ plus four characters that were valid hex digits. When this happened the randomly generated dir name interacted with the backslash before it and become a Unicode escape sequence. It was easy to fix but that test was flaky for months before anyone worked out why.
A bank I did some contracting for had a problem where their Token-Ring network would crash at random intervals during the day in one of their branches. It would also crash at night, but the times when it would happen were more predicable.
And that was the clue they needed to solve the problem - it turned out that the wiring installers had run the cable up the elevator shaft. When the elevator stopped at a certain floor the door motor was sometimes interfering with the signal. The more-regular nightly disruptions were because of the security guard making his rounds.
It turned out that run was pretty close to the length limit for 16mbps Token-Ring, so they added a repeater in the middle to boost the signal strength.
I eventually gave up without finding the issue, but somewhere deep inside one version of the Sphinx full text search software was a bug that would sometimes switch query got what result set. It only happened sometimes when queries were within a few seconds of each other, but it wouldn't happen with only one front end process even in multithreaded mode and would disappear if requests were _too_ close together. If I'd found a way to reproduce it I'd have submitted it to the Sphinx team, but after a few days of potentially private info leaking I gave up and moved to PostgreSQL's FTS.
Similar to the train being stopped by a toilet flush, in the 90's, I worked with devices based on Microsoft's Pocket PC OS. These were equipped with wireless radios. The transmission of a packet caused some interference that the device interpreted as a click on the screen. The cursor was over the [X] to close the application window, so the application would just quit, looking like it crashed.
IBM Java 1.1.8 that was embedded into Lotus Notes 5 (if I remember correctly) didn't have 29th of April if it happened to be on Tuesday.
When you constructed a Date object of 29th of April with such a year that it was a Tuesday, you get the 30th of April when you read back the value. Took a while to figure out why date calculations were sometimes off. The flux of expletives was impressive when we finally did...
No idea, it was years ago. JVM-s were extremely buggy. We were quite n00bs at that time also so we didn't dig into the depths of it, we had to fix production issue that we had already spent a lot of time on. We found the bug by just narrowing down a broken case further and further. Finally we wrote a code that just tried a range of dates one by one over several years to find out the pattern. Unfortunately it wasn't possible upgrade the JVM as it was embedded into Lotus Notes, so we wrote our own date implementation (yay!) that satisfied our needs. It was fixed on later Notes version, but our sh*tty date implementation lived much much longer...
Not software, but along the lines of cool engineering stories one of my favorites is this one about fixing 230 kV, many-hundred-amp, 10 mile long coax cable in Southern California.
I just had a weird bug in a programming competition.
You basically had to sort the English letters that occur in a text according to their frequency descending. Except the one letter that occurs the least, needs to be sorted as if it occurs the most.
The expected output of the sample case was TPFOXLUSHB
I ran my program on the sample, the output looked correctly; then I submitted it, and the judge said it failed the sample case.
In fact, it was printing ͲPFOXLUSHB
That nearly looks like the correct output.
I had confused two variables and it was printing the frequency count as codepoint rather than the letter. But such a coincidence that it looks the same
I used to work as a senior technical support for BEA Weblogic and had all sorts of crazy situations to debug remotely. Including one time when I had to get a person to edit a config file in Vim (which they never used before), on Unix (which they never used before), with me guiding them by phone (no visuals).
And if somebody really understands network and multicast, I would love to know whether I actually nailed the problem or just made it go away accidentally. I have no problems with being wrong, especially this much later :-)
I find that most of these "hard problems" are something small, so something that's almost unnoticeable to not break immediately but that makes it kinda work.
for me it was 21 day bug for a broadcast video encoder, some internal frame counter was coded with int instead of int64. would reset every 21 days. fun to debug it was not!
It is good to pore over these legendary tales. They come in handy when we need to break out of the moment and try something outrageous to solve the problem.
Crash cows: “There were often significant food shortages in the Soviet Union, and the government plan was to mix the meat from Chernobyl-area cattle with the uncontaminated meat from the rest of the country. This would lower the average radiation levels of the meat without wasting valuable resources.”
There is some sense to that: low levels of radiation are not a cancer risk last I read - everything we eat is slightly radioactive. That said, I can’t think how significantly radioactive cows could be “diluted“ enough.
That describes a real machine, the LGP-30, which really had absolutely no business being as powerful for the price and era as it was. It used an oscilloscope for its (tiny) debug display: the operator’d read the voltage waveform directly as binary.
So at my previous employer we still had an old frontend running an atrociously old version of EmberJS. I run the build but it suddenly fails. We were on a three weekly release schedule, so I figure it must have happened somewhere in the past three weeks, about 400 commits. So I start Git Bisecting, teaching it to a handful of my more junior colleagues as I go along. It took us way to long to figure out that the original build also failed however.
So my teammates wish me good luck, and I go off on a journey debugging what the issue actually is. As it turns out, the horribly old Ember.JS CLI package version we're using is a version called '0.2.0-beta'. That did not bode well. This frontend of course did not use the nice yarn dependency pinning, just a regular old package.json file, so I go tracing the error into the dependencies.
Eventually I trace the thing do a dependency nested three layers deep or so. A library added a deprecation warning when being used. That in itself is not so bad, but it did that using an injected logging framework from the package using that library. Except that wasn't introduced until a way later version. Ofcourse this tiny little addition could never cause any breakage, so this was released as a semver bugfix release.
The commit time shows it was 23:00 local time (https://github.com/goldenice/ember-cli-babel/commit/c4c95d6f...) when I figured the problem out and committed a fix. So I submit a PR to the library, figuring that if the author happens to be awake I won't have to figure out a way to pin the dependency to an earlier version (which would have been easy in a regular dependency, but this was a global dependency, where it's not as trivial as switching from npm 2.x or 3.x to yarn)
The author thankfully responds almost immediately, asking me to provide a fallback to console.warn instead of skipping the deprecation entirely. Makes sense, so I update my PR, submit it within a few minutes, and I see that the author immediately publishes a new version. Finally something works out for me. Or so I thought.
As it turns out the author made a tiny stylistic fix in my code. Except that tiny stylistic fix butchers my carefully crafted if statement, and now the code is broken again. It took me a while to figure out that the new version of the dependency WAS being used, but was also broken.
So I contact the author again, explain the situation. They changed the code immediately, pushing out another update. In the meantime I figured out how to do dependency pinning and all was well with the world again.
And that kids, is the story of how I came to appreciate transitive dependency pinning as a really useful feature.
(It's still amazing to me by the way that I can contact somebody that wrote some random code that our code happens to rely on and get a response within half an hour.)
I've posted this story before, but it fits here rather nicely.
I had a function that looked like this:
This function would sometimes exit. But that's really all there was to the function. Somehow flag was becoming false, even though nothing ever wrote to it.
So you might think about g() smashing the stack, when a variable is mysteriously changing, but you'd expect the return address to also get written, and it wasn't - the function returned from g() to f(), found flag to be false, exited the loop, and returned from f().
Eventually I got desperate enough to look at the assembly code produced by the compiler, and I became enlightened. (This was g++ on an ARM, by the way.) flag was being stored in R11, not in memory. (Might have been R12 - it's been a while.) When g() was called, f() just pushed the return address. Then g() pushed R11, because it was going to have its own variable to stash there, and then created space for its stack variables. And one of those variables was smashing the stack by 4 bytes, over-writing the saved flag value from f().
Worse, the way the stack was getting smashed was on a call to mesgrecv(). This takes a pointer to a structure and a size, but the relationship between the two isn't what you'd expect. The size isn't the size of the structure, but rather the size of a substructure within that structure. A contractor had gotten that detail wrong when they used that mechanism for IPC between two chips. (They'd gotten it wrong on the sending side, too, so the data stayed in sync.)
The net result was that the flag got cleared when four next-door-but-unrelated bytes on another CPU were all zero. It took me a month, off and on, to figure that out.
Crazy thing to go with that... if your compiled with different (more aggressive) optimization flags, it might have gone away!
It already went away when I tried to print out the address of the variable, so that I could watch it in the debugger (because, in order to take the address of it, it had to become a stack variable).
3 replies →
I love these toughies, especially the full ssh one! A true debugging wizard.
“Dumb” problems can happen to anyone too. I once walked by the desk of$well_known_open_source_developer who was struggling with a mysterious bug. He’d narrowed it down to the specific function and was groveling in the function setup code (what the compiler generates before your code is called) He asked me to take a look and within seconds I saw an uninitialized variable being read.
This is not because he was a bozo! He had decades of experience. It’s simply that sometimes we get slightly wedged and can’t see the thing that is “staring us in the face”. He was embarrassed (so not mentioning his name) but he should not have been. If anything it simply proves that it can happen to anyone.
Related to this: at one organization my debugging skills were (spoiler: undeservedly) legendary...literally word got around until some new hire asked me about it months later.
Why? I came in one morning to find some folks trying to get some new model of terminals to work with the mainframe. Back then you needed the right combo of byte length, stop bits etc. they asked me if I could fix it and I said sure. As one does I poked at the setting switches and walked off to get my coffee So I could come back and think clearly. By the time I came back all the terminals were in use so I just went on with my day.
Apparently I had randomly toggled the necessary bit. But the way the story was told: I had walked in, agreed to help, rubbed my chin then simply pushed the right button and walked off without another word. Which in some sense is true, But gave me completely undeserved credit.
When I was a kid (teenager) I worked at an indoords shooting range, mind you, not real guns but just BB guns. I was supervising a bunch of school kids do some practice shooting at biathlon targets (just 10m however) and one of them had an issue with the gun with the pellet getting stuck somehow. I had a look at the gun, sorted the pellet out and fired the gun off the hip and hit a bullseye without aiming. Pure luck of course but the kids were like "woaaah" and of course I never told them that it was just luck and not my mad leet shooting skills XD
on the subject of "dumb problems" and undeserved acclamation:
One day I walked into the break room and observed one of the dev team leaders pissing on about how the vending machine didn't have the snack he wanted. I tapped on the glass right in front of him and he was instantly chagrined at the appearance of what he was missing right in front of his face.
This happens often enough I have some stock humor saved up for the occasion. In a serious tone I told him not to fret, I had observed this was a common problem with good developers who don't take enough breaks and with a little self-examination he could overcome it. In a kidding tone, I told him he was obviously stuck in trap of working too hard to work smart. "Just check your assumptions with a Pareto graph and a lot of life's little dilemmas will fall apart into easy pieces," I sez with a chuckle.
Later on it turned out he had told his team and they had a chuckle over how I enjoyed kidding them. Then they turned to the problem at hand and my guy had an aha moment and realized one of the grounding assumptions they had since literally the beginning of the project was subtly off. The week before I had solved someone's Java boxing/unboxing problem literally by accident in a single glance at the debugging trace whilst kidding him over his attempt to cast Integer into Long (int promoted to Integer in a library call). The team added up all my bad jokes and occasional "accidental" help, and in a collective Aha decided I was some kind of software engineering guru (I'm not - just a lot of painful painful experience to make light of).
And for a few months thereafter the devs would intently ponder everything I said. That quarter I actually got devs assigned to my issues because they all clamored to rub up against my supposed enlightenment, heheh. After I realized what was going on I got my top ten addressed and then suggested they needed to apply their learning to other people's issues...before they discovered my feet of clay...
This is quite literally the case of it being better to be lucky than good!
Here's the craziest one that actually happened to me.
The company I worked for had installed what's best described as a mini-supercomputer (though we avoided the term) at a site in Boulder. We started getting reports of failures on the internal communication links between the compute nodes ... only at high load, late in the day. Since I was responsible for the software that managed those links, I got sent out. Two days in a row, after trying everything we could to reproduce or debug the problem, I got paged minutes after I'd left (and couldn't get back in) to tell me that it had failed again.
Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that. It wasn't the problem.
What it ultimately turned out to be was airflow and cooling. The air's thinner up there, so it carries less heat. But it wasn't the processors or links that were getting too hot. It was the power supply. When a power supply gets warmer it gets less efficient. Earlier in the day or with shorter runs as we tried different things this wasn't enough to cause a problem. With it being warmer later in the day, continuous load for longer periods was enough to cause slight brown-outs, and those were making our links flaky. And of course it would always restart just fine because it had cooled down a bit.
The fix ended up being one line in a fan-controller config.
I had a loaner machine (RS-6000 minicomputer) that would have unrecoverable ECC errors when the cover was on. The tech would come and try to diagnose it, but with the cover off, everything would work fine. He'd swap the memory anyway and put the cover back on. within a few hours the memory bank would be failing again. Turned out the machine had been a loaner in a lab where it had acquired some alpha-emitting goo on the inside of the side panel. The lab had just run it with the side panel off to solve the problem, never noticing the goo, never mentioning it to IBM when they packed it up to ship.
It's a long story but the gist is after multiple board swaps, realizing we'd isolated the panel as the fault, I noticed the goo and on a hunch checked it with a scintillator, deducing it was alpha when cardboard blocked it. Turns out the ultra-precious-metal IBM heat sink on the board had an open path that effectively channeled the alpha particles into one of those multi-chip carrier thingies, which featured exposed chips.
As for why I had a scintillator lounging in my desk at a portfolio management company, don't ask. Let's just note the iconic IT anti-hero of that era was the Bastard Operator From Hell, and leave it at that.
Unrelated to a strange bug story or anything but you just reminded me of when I was also helping someone set up a, as you called it, mini-supercomputer. It was to do quantum simulations. We were setting it up and the researcher who was going to use it made the root user name skynet. Now I know that joke has probably been played out at campuses around the world but it just seems unnecessary to tempt the fates like that.
> Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that.
Wow, I sense a more interesting story in here. Care to reveal how it was first found out and how common it actually is?
In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more of a thing at higher altitude because of less atmosphere. It's rarely a problem at sea level. At higher altitude you really need to use ECC memory, and do some sort of scrubbing (in Linux it's called Error Detection And Correction or EDAC) to correct single-bit errors before they accumulate and some word somewhere becomes uncorrectable.
The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).
1 reply →
"Fail on certain moon phases" reminds me of a C++ bug I encountered while trying to set up the demo for PSIP (Digital TV Guide) destined for NAB in Las Vegas. We had programming schedules resembling excel spreadsheets and my job was just to create a good one for the demo. I would spend all night making one and sent it to my boss and each morning would get in trouble for sending in blank schedules and had no idea why. On one occasion I happened to be editing at 3am and noticed all of my edits rolling back one by one. It was actually viewable on the screen as if someone took control of excel and was rolling back each field. My immediate thought was I really need to get some sleep but later we found the auto-save feature inverted itself after 3am exactly and would go through each delta one by one rolling itself back as it had been edited. The bug was found in the calculation of the vernal equinox which moves from 3am to 9pm to 3pm. Since it was triggering the leap year code 6 hours of time would get rolled back edits and all! This was of course 2008 year of the digital transition from analog cable which happened to also be a leap year.
I can't scan documents when my daughter is asleep. When she is awake, all is fine, but the minute she goes to sleep, and I'd like to use my free time to scan documents and suchlikes, forget it. I could still print documents on the same device though. Here's what I found:
The printer-scanner was connected to wi-fi. The wi-fi router was in my daughter's room, as that is where the cable socket was, tucked just behind a bench in her room. It was also near that bench that her baby monitor camera was standing. It wasn't wi-fi connected, but for whatever reason it interfered with the wi-fi signal. Same with the receiver, if I put it near my laptop, the wi-fi connection would die.
The monitor was off most of the time, and on precisely when my daughter was asleep.
As for why I could still print, just not scan: presumably that's something to do with the bandwidth, I'm guessing it took more wi-fi bandwidth to send a scanned image than to print a document (I never printed pictures on this printer).
Yeah, baby monitors are THE worst behaving radio devices. They would be second after malfunctioning neon sign transformers, but transformers are not intended as radio devices.
As for why scanner and not printer losing connection - probably printer has small buffer and scanner doesn't pause when that send buffer is full due to trying retransmissions, but stops completely. Printer probably can wait for more data.
And yet there are people out there who will still try and argue that Signs is not a brilliant film
Still... my thinking is, I'd rather a crappy baby monitor than a badly-secured device connected to wifi, and by extension, the whole world.
We do actually have a spare camera we sometimes take away, that works on Wi-Fi. I accidentally left it once at my parent's place, then travelled back home overseas. I got a notification of my baby crying desolately as I got off the plane. Turns out it was my family arguing something in the room where my daughter was staying, and the camera somehow switched itself back on. It was freaky!
The main reason 2.4GHz band is unlicensed (anyone can broadcast on it without e.g. a ham license) is because it's garbage. Baby monitors and microwave ovens are two of the most common offenders for dumping large amounts of noise into that band. If you can move the printer to a 5GHz band, that could help a lot.
And in the near future, wifi 6E will add a ton of spectrum, allowing devices to just avoid noisy channels. But to use that you'll have to upgrade the printer or add a wifi 6E bridge.
Had a Customer who complained that the Wi-Fi in one of their conference rooms was "always" unreliable. I checked it multiple times and didn't find any problems-- low SNR, strong signal, low airtime utilization.
Eventually I learned out that "always" meant a particular recurring lunchtime meeting, scheduled right when a steady stream of workers were going into the break room across the hall and heating food in a microwave oven.
I had a Panasonic 2.4Ghz jamming device. Could take out channels 1-11 all at the same time. It wasn't supposed to be a jamming device, it was supposed to be a cordless phone. Setup wifi for a customer, and it worked fine when I was on site. Later the customer complained it was completely unusable, in what they thought was a random fashion.
Turned out whenever they used the cordless it pretty much made so much noise on the 2.4G frequence wifi wouldn't work. Phone itself worked fine. Had them get new phones and the problem went away.
2 replies →
You really want to think about moving the Wi-fi out of the baby's room. Get a GQ-390 meter and you will see the torrent of dangerous radiation flooding her room, above recommended levels.
I was able to get my Internet provider to relocate the device to my basement.
I just exposed myself to radiation from the largest fusion reactor in the solar system! Should I panic?
....errr....
Wait.
I have just been slipped a note that the first sentence I wrote means "I just went for a walk in sunshine."
1 reply →
Better stay inside, the outdoors are bathed in a torrent of EM radiation from around sunrise to sunset, every day.
In fact, the other people in your home are bathing you in infrared radiation when you're near them.
I bet you believe in 5G conspiracies too
Really? I don't know much about that... is it that the norms are unsafely high for these devices, or is it that most kit doesn't quite meet them?
Could you point me to any articles etc. on this?
Thanks in advance!
6 replies →
My favourite crazy bug was during a university course on autonomous robotics. One of the other groups was using a a metal castor at the back of the robot along with 2 driven wheels. After a little while their robot would completely crash and stop responding.
I'd previously encountered a similar issue which was due to the library code we'd been given which opened a new /dev/i2c file for each motor command, eventually exceeding the max file handles and killing the program. So I assumed it was something sensible like that.
Some time later they got all excited and called us over to explain the real reason it was crashing. Their robot would initially work fine for a reasonable period of time. Then when the robot drove over the metallic tape on the floor of the arena it would die. The robot must have been building up a static charge while moving around which would eventually be dissipated when the metal touched the tape.
I wouldn't have believed it had they not setup two tests, one outside the normal arena and one inside. Changing the metal castor for a bit of lego fixed the problem.
Display intermittently blanking, flickering or losing video signal:
https://support.displaylink.com/knowledgebase/articles/73861...
"Surprisingly, we have also seen this issue connected to gas lift office chairs. When people stand or sit on gas lift chairs, they can generate an EMI spike which is picked up on the video cables, causing a loss of sync. If you have users complaining about displays randomly flickering it could actually be connected to people sitting on gas lift chairs. Again swapping video cables, especially for ones with magnetic ferrite ring on the cable, can eliminate this problem. There is even a white paper about this issue."
Risks Digest was on my daily reading list for years. I've been in computing since the early 70s, the history of computing is fabulous background to inform e.g. debugging, or SRE.
I've seen this one on twitter https://twitter.com/royvanrijn/status/1214162400666103808?la... There's probably some civilizational complexity limit where the unexpected interactions between seemingly isolated pieces of tech become so bad that we cannot introduce anything new without introducing legion of weird bugs.
>There's probably some civilizational complexity limit where the unexpected interactions between seemingly isolated pieces of tech become so bad that we cannot introduce anything new without introducing legion of weird bugs.
Come to think of it, I believe that meme was wondering about on UseNet from before the early days. I think Vernor Vinge alluded to it in one of his novels, (paraphrasing here) some interplanetary civilization crashing because in-transit space traffic was so dense no new launches could occur in a useful time frame due to safety-lock outs, and they didn't want to accept the risk in changing the safety margins...
I got a tiny one of these.
One time I was writing some code in C. I found a bug, the solution seemed pretty obvious, so I fixed it, recompiled the code, and ran it again. The bug was still there.
I took a look at the rest of the code in case that I missed something. I couldn't find anything, so I added a few print statements and recompiled. I ran the code and nothing came up.
Interesting, apparently the code is not executing the branches it should. I verified the input data and code. It didn't make sense, there had to be some serious bug there that I didn't consider. I added a bunch more prints.
Recompile and execute. Still nothing. Wait a minute, THAT doesn't look good. I added a print statement right at the entry point of the program. Nothing.
At this point the root problem became apparent; my changes just weren't getting compiled. Phew, problem solved! I cleaned all the cached files and recompiled the source code. Those print statements still weren't coming up.
At the end I had to move my source code to another machine and compile it there to get it working. I suspect some global variables or path trickery to be involved, but up to this day I still haven't got a clue what was wrong, or have I seen it happen again.
Ahhahahaha. I have a similar story to that, but I eventually realized what happened.
I forget which command it was exactly, but I rsync'd or something to get a new code directory, and with the backup options in use the directory I was in got renamed.
But I still had command prompts open in that directory. And all of the files were there. So I didn't realize that one directory was not equal to the others even though it had the same name (it was a subdirectory) and appeared to have the same files.
One similar story: I was maintaining some C++ code that had a few #ifdefs in it. Someone reported a problem.
I put a breakpoint on the calling code and traced into my code. It went into the #ifdef code I expected, but the problem persisted.
Just to double-check, I let the program run until it hit the breakpoint again and traced in, but this time, it went into the #else code! That code should have been removed by the preprocessor, yet here I was, currenting stepping through it.
After questioning my understanding of the C preprocessor (and my own sanity), I luckily noticed that the module name in the debugger was very slightly different in the two cases mentioned above.
The world finally made sense again. My code was in a header file that was compiled (with different symbols defined) into two different modules and both of those modules were loaded into the same process. When I set the breakpoint in the debugger, it silently set breakpoints in both modules.
.o file stamp newer than .c timestamp. Make figures out file can be skipped. Run recursive touch on these problematic files and headers.
The most fun is when the build then restarts, and do a build cycle until the the clock passes some magical timestamp making the build succeed.
1 reply →
Obvious in retrospect, but very surprising to my inexperienced past self:
I'd been working on some C code for an hour or two. It wasn't behaving how I expected it to (and at the time I knew nothing of debuggers), so I added a print statement and recompiled. I got a compilation error: something like "Syntax error on line 123: #incl5de <stdio.h>". Shocked, I scrolled to that line in my text editor to fix the typo, but it wasn't there. I compiled the same code again and there were no errors.
Turns out there wasn't a bug in my code! I immediately shut down my computer because my RAM was going bad. To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
Reminds me of a rouge NIC, flipping a single bit every now and then, that took down S3: https://news.ycombinator.com/item?id=13859733
I recall a talk about someone registering domains similar to google.com and due to occasional bit flips people landing to these domains.
1 reply →
The RAM going bad in my PC was one of the most annoying issues I had to troubleshoot. It started with having my firefox pages randomly crash. first occasionally and then several times a day.
I then started getting errors when playing games with obscure error codes - which yielded nothing when searching them up.
I eventually found a comment in a thread about the crashing game that the RAM could be bad. I ran some diagnostic tests and with the number of errors that came up I was surprised my computer worked at all
> To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
With my current computer I overclocked the RAM to the best config I could get memtest to run without errors over the night. The RAM also has ECC and there were no problems reported during normal operation, even when re-compiling (most) packages. But when I got to compiling LLVM the system would crash shortly after logging ECC errors. Backing off the overclock a bit fixed that. So the rate of memory errors can definitely depend on what you are running.
> To this day, what surprises me most is that my computer was able to successfully boot and behave normally for an hour or two, even though random bits were apparently being flipped.
Yup. I once had a machine that would freeze up when I ran package updates. I thought (of course), that the package manager had a bug. Turned out, running the upgrade was the only thing memory-intensive enough to use the faulty memory that I'd installed. After all, on a light system you can totally boot and run in <1GB of RAM...
I remember reading one years ago where someone had a problem installing new software on some embedded device - whatever they did it came up "checksum is bad".
After much testing they eventually realised that the checksum literally was the hex "bad".
I have two favourite bugs, one weird and one dumb.
Weirdest one was an IDE where the colorizer gave up on source lines longer than 998 chars. Instead it rendered the whole line as background, i.e. invisible. I once wasted two hours debugging a program with an invisible line of code!
The dumbest was a postage billing system for a bank using a third party Print-and-Mail company. Somehow the billing system went live adding the previous day's total postage costs to itself, then adding the new day's postage. These expontentially growing totals were then paid automatically by the accounting system each night... So it goes live, ... and a week later Finance gets an alert the account is overdrawn... They actually paid out nearly $1b in postage costs before hitting their internal credit limit with the bank's treasury.
Which IDE was it?
That is what you want to know? :D
2 replies →
I love this article.
My most recent bizarre bug: a coworker came to me with a bug where no matter what he tried, he could not get an if some_var is null to be true. The debugger would show the value was null. The logger showed the value was null, but the if statement would not work. After a morning of trying to fix it, he asked if I would take a look. I told him to put the null in quotes in the if. It worked. Turned out a JavaScript library had a bug where it would use the string "null" instead of null.
There's a popular progressive API I've been forced to work with that uses "true" and "false" instead of their respective booleans. The most egregious that I've ever seen in this class of errors was an API (from Google!) that returned " false".
Also watch for code that runs off the edge of the screen, to the right, while editing in an app with horizontal scroll bars.
My favorite was when I was working at SGI after it had taken over Cray Research. I was one of the lowly Cray guys in Wisconsin working with the wonderkind in California. I was to run the regression tests on the chip being design in California using some software that they had provided. I would run the tests, but some days they would crash in the middle of the night. Then the California guys would be angry that they got no tests results. I started debugging the code and got to a program called lswalk that would dole out jobs to the dozens of servers to be 'run'. The code was written by a hot shot young MIT graduate, but I was sure that the problem was with this code. I got the source code and started looking for problems and one thing I found was that if one of the servers resplonding with error the code would print out an error message. One problem though... The error string printed had an uninitialized string, so that when the printf routine would search for an end of string that was never there, probably overwriting buffers and crashing code all over the place. So one lesson is that even the best and brightest make mistakes. Sometimes I wonder how we accomplished anything in those days with software that had so may trap doors beneath it.
The very first time my company farmed me out to work onsite with a client. Day 1, Job 1: download the client's website code and get it running on my laptop.
... It just wouldn't work. Everything I tried - failed. Everything else on my laptop was working fine, except this code. Everyone else who had ever downloaded the code had managed to get it working on their machines within a couple of hours. Colleagues working onsite with me tried to help, but everything they tried - failed. Finally the decision was taken to reset my laptop to factory defaults and reinstall everything. That took up half of Day 2. Tried to get the client's site running - failed! Things were beginning to get really embarrassing - all this was happening in full view of the clients. In desperation, my company called me back to their offices and issued me with a new laptop. Back onsite, the code downloaded ... and worked first time!
Turned out that the issue was that my hard drive filesystem had been setup (not by me!) as case-sensitive, and the client code included a file with an all-caps filename, which the code called using a lowercase string. Almost lost my job over that one.
My weirdest that I can recall right now was a PDF file that would not print. Since the printer was typically unhelpful with the error message, as was support (this is a room-sized commercial printer but we didn't get the help I'd really expect), I had to dive into it myself.
Long story short, whatever had produced the PDF had also embedded a TrueType font where one character was named //something. This is fine. The character just has a weird name, but it works. It's technically up to spec AFAIK, and I got it out of the PDF with ttfdump to have a look at it.
Well the printer's internal RIP, unknown to us, converted the PDF to Postscript when rasterising. And //something is called an "immediately evaluated name" which I forget the details for, but basically this font character, interpreted as postscript, was causing a lookup for a named variable which did not exist. Hence the crash.
I had a similar one where Adobe InDesign had been used to make a PDF where someone had selected the words to change the font, but not the spaces between (or perhaps they did, and it was a bug). This meant that the PDF included a subset font that only included the space character. Since the space character is not drawn, this resulted in a 0 byte long glyf table. Based on my reading of the TrueType font spec at the time, this isn't really proper.
Printer didn't like that one bit and died as it does to anything that smells slightly wrong. Adobe said it was fine and up to spec though, apparently 'TrueType' has a different meaning inside a PDF :)
My favorite out of these is the 500 miles email limitation one. I work mostly on big bulky manufacturing equipment but my job is to abstract out the computing part. This story reminds me that every time I want to do something I am still limited by physics. I am reminded of this story whenever the hardware people ask me to insert an artificial delay in computation.
Yes the old limits of the time style bugs are great. SOme arbitrary limit or variable to hold a value deemed way more than enough at the time, for years/decades later to jump out and catch you out. Y2K was one of those well known ones, but been many of those types of bugs.
I love programmer stories like this. My favourite personal experience was on my first Ruby on Rails project after first moving to London. I was pretty green at the time, having had only a few years of PHP experience under my belt and little else.
We had to build a Rails app around a poker game. We didn't own the source to the poker game or its API, but we had to embed it nonetheless. We had this really strange issue where some people, under a certain circumstance, couldn't get into the game. It would just boot them out. Me and my team mate must have poured through the Ruby code dozens and dozens of times and found no evidence of this bug, no ability to reproduce it; bearing in mind I was still learning the ropes and jumping head first into an unfamiliar codebase is quite daunting.
Eventually I decide to get my hands dirty and I start poking into this game engine. We embedded it as s flash widget, but the server doing most of the work was written in a mix of C++ and Python. I didn't fully comprehend what I was looking at but, even though things looked suspicious, I couldn't put my finger on an actual problem until I looked at the API written in Flask and noticed that one line of code didn't look like any other.
If the request didn't contain the parameter `some_key` then this would raise a KeyError.
After maybe three solid weeks of trying to debug this thing, I submitted a one line patch:
It's not quite as weird or as fun as most examples but for me personally it was such a great lesson in debugging and being curious about unfamiliar stuff, rather than closed off or afraid.
I had this one.
We were using mSQL in the 90s for web projects. A very important customer wanted a "real" database so we bought DB2. Because we didn't have an IBM plattform or Solaris we went with Windows NT.
Everyhing went fine, until one day we recognized the website being slow. Investigating brought the database as the culprit. So I went there and logged into the NT box in the data center and checked the DB2. Everything was fast. Back to my desk and the database was slow again after some time. Back to the NT server and the same thing happend.
After quite a long time I found the real culprit. The NT pipes GL software render screen blanker. After some time without interaction the screen blanker started up and took all the CPU. So the database and the website went slow. Someone had set the screenblanker to the nice GL pipes renderer.
[Searching the web, IBM introduced DB2 for Windows NT 31.10.1995 and I went to Cebit that year to check it out]
This reminds me a nice day we spent at customer's premises trying to figure out why DB2 won't install or start properly on a win2k box. Weird error messages etc. Problem was that it didn't like that the box was named 'DB2'...
See also:
“COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES”
https://www.cs.earlham.edu/~skylar/humor/Unix/computer.folkl...
“Computer Stupidities”
http://www.rinkworks.com/stupid/
We're actually working on a collection of such stories internal to our division. We've found that these tales are a great way of helping people understand the complexities and quirks of our nearly 3 decade old code base.
I think story telling is an underrated technique in our profession.
In all projects there are coding rules (like "make destructors noexcept"). A rule sticks much better if you also tell a story about some debugging caused by not following the rule.
I worked at a place one that had a style guide for the main front-end language used with links to terrible things that had happened as a result of breaking the style guides rules.
It was surprisingly effective, so I completely agree with you on the story-telling.
First month or so at my new employer, big consultancy firm for a financial institution. Had a fairly complex distributed monolithic application integrated with Tibco EMS, Oracle DB and distributed XE transactions.
Regularly, but randomly, in production, after receiving a good amount of messages in the input queues, (which then got rerouted to other event queues for parallel processing) some DB transactions simply were getting stuck. Not rolled back, but stuck in limbo -- after a while the DB simply refused new transactions because so many were stuck. Nobody got a clue on why that was happening, it meant regular manual restart of the services and re-feeding of the failing messages. Users started to get fed up and the project threatened to fail.
Got into it, after couple of weeks of investigations and trial and errors with all possible weird flags, turned out that the version of Tibco EMS had a wierd behavior with distributed transaction when the queues got full of messages (queues had 50MB size limit).
Instead of rolling back gracefully the JMS+JDBC XE transaction, it...kinda exited with an IO error.
Turned out that newer versions of Tibco EMS fixed that issue, but no way to ask ops to install that new version. Since upgrading was out of the question, the actual fix was to enable message compression to limit the size of the messages coming into the queues, turned out that the XML we sent there were up to 1.5MB (!)
After discovering that, became basically a war hero and respected by the client as the "savior of the project". Good times.
Your compression workaround reminded me of an issue I ran into a while back.
My team at work uses a reporting tool for vulnerability assessments and pen-tests; basically you can import a bunch of data files, review it in the web app, and generate a report.
I would run into cases where I couldn't upload one of my data files. The web app is JS-heavy, lots of things going on in the background without much visible feedback. It turns out that the programmers had implemented the upload as this async task with a hard-coded timeout for completion, and they likely wrote it while they had great network speed.
I'm on DSL, and generally, it gets the job done. However, upload speed is only 1Mbit/s, so with a big file, my upload would time out. It's hard-coded remember, so it didn't matter that it was still functioning when it got clobbered.
It occurred to me that some file formats, like WAR or Office documents, are basically Zip archives under the hood, so I put my large XML file into one, and tried that.... and it worked! Something on the back-end quietly unzipped my upload and imported the file it contained.
Funnier is that when I mentioned it to the devs, this behaviour was not something they expected. Probably built into a library they use.
This one happened last night. A student contacted me because her Anaconda Jupyter notebook (installed on her laptop) just wouldn't connect to the Python kernel. (The notebook itself would load, though, meaning that the server was running fine. It's just that the kernel and its websocket was failing.) I should point out that, because of COVID-19, this troubleshooting was over Zoom, which complicated the diagnosis a bit.
She had not been using Jupyter for several months, as we have been writing stand-alone programs in class using Spyder (the editor that comes with Anaconda), and the command line, and Jupyter had worked the last time that she tried it.
We restarted everything, and still the problem was there. I helped her to update everything, but that didn't solve the problem.
Finally, I looked at the error messages in the console where the Jupyter server is running. It had a huge list of errors, all relating to the pickle library.
We had done an exercise with pickle in the class, but nobody had reported a similar problem. When we looked in her classwork directory, though, we saw that she had created a "pickle.py" file when she was testing something with pickle. But, at that point in the class, we were working in the command line, and everything (including Spyder) still worked just fine.
Evidently, this was the cause of Jupyter's problem. When trying to start the Python kernel in Jupyter, it imported pickle, and evidently it imported her test file rather than the actual library. The fix was simple: we renamed her test file, and everything worked perfectly.
There's a GitHub repository for such stories [1]. I've even contributed one of my stories: "Script crashes before 10 a.m." [2]
[1]: https://github.com/danluu/debugging-stories
[2]: https://darekkay.com/blog/script-crashes-before-10/
This scratches a "thedailywtf" itch I had forgotten I had :)
One favorite in the category is always the good old "Can't print on tuesdays" ubuntu bug that has been submitted here a bunch of times: https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...
I had a function which only failed at 8 or 9 minutes past the hour.
It parsed a string containing the timestamp and "08" or "09" was interpreted as an invalid octal number. Argh.
The beauty of loose dynamic typing
Had a flaky unit test that would randomly fail with some random Chinese character in the output.
The test was running a log parsing tool against a temporary file that had a pseudo-SQL syntax where you could “select ... from c:\Users\...\temp\abcd1234.xyz\testdata.dat”. The temporary directory was a randomly generated name so that the folder was guaranteed to be empty before every execution of the test.
The test failed on the rare occasion that the randomly generated temp dir consisted of the letter ‘u’ plus four characters that were valid hex digits. When this happened the randomly generated dir name interacted with the backslash before it and become a Unicode escape sequence. It was easy to fix but that test was flaky for months before anyone worked out why.
A bank I did some contracting for had a problem where their Token-Ring network would crash at random intervals during the day in one of their branches. It would also crash at night, but the times when it would happen were more predicable.
And that was the clue they needed to solve the problem - it turned out that the wiring installers had run the cable up the elevator shaft. When the elevator stopped at a certain floor the door motor was sometimes interfering with the signal. The more-regular nightly disruptions were because of the security guard making his rounds.
It turned out that run was pretty close to the length limit for 16mbps Token-Ring, so they added a repeater in the middle to boost the signal strength.
Was looking for the 500-mile email - it's there.
I eventually gave up without finding the issue, but somewhere deep inside one version of the Sphinx full text search software was a bug that would sometimes switch query got what result set. It only happened sometimes when queries were within a few seconds of each other, but it wouldn't happen with only one front end process even in multithreaded mode and would disappear if requests were _too_ close together. If I'd found a way to reproduce it I'd have submitted it to the Sphinx team, but after a few days of potentially private info leaking I gave up and moved to PostgreSQL's FTS.
Similar to the train being stopped by a toilet flush, in the 90's, I worked with devices based on Microsoft's Pocket PC OS. These were equipped with wireless radios. The transmission of a packet caused some interference that the device interpreted as a click on the screen. The cursor was over the [X] to close the application window, so the application would just quit, looking like it crashed.
IBM Java 1.1.8 that was embedded into Lotus Notes 5 (if I remember correctly) didn't have 29th of April if it happened to be on Tuesday.
When you constructed a Date object of 29th of April with such a year that it was a Tuesday, you get the 30th of April when you read back the value. Took a while to figure out why date calculations were sometimes off. The flux of expletives was impressive when we finally did...
Any ideas on why it happened? I'm impressed you found that bug though.
No idea, it was years ago. JVM-s were extremely buggy. We were quite n00bs at that time also so we didn't dig into the depths of it, we had to fix production issue that we had already spent a lot of time on. We found the bug by just narrowing down a broken case further and further. Finally we wrote a code that just tried a range of dates one by one over several years to find out the pattern. Unfortunately it wasn't possible upgrade the JVM as it was embedded into Lotus Notes, so we wrote our own date implementation (yay!) that satisfied our needs. It was fixed on later Notes version, but our sh*tty date implementation lived much much longer...
Not software, but along the lines of cool engineering stories one of my favorites is this one about fixing 230 kV, many-hundred-amp, 10 mile long coax cable in Southern California.
https://www.jwz.org/blog/2002/11/engineering-pornography/
Glad to see the More Magic story in the list.
I just had a weird bug in a programming competition.
You basically had to sort the English letters that occur in a text according to their frequency descending. Except the one letter that occurs the least, needs to be sorted as if it occurs the most.
The expected output of the sample case was TPFOXLUSHB
I ran my program on the sample, the output looked correctly; then I submitted it, and the judge said it failed the sample case. In fact, it was printing ͲPFOXLUSHB
That nearly looks like the correct output.
I had confused two variables and it was printing the frequency count as codepoint rather than the letter. But such a coincidence that it looks the same
I used to work as a senior technical support for BEA Weblogic and had all sorts of crazy situations to debug remotely. Including one time when I had to get a person to edit a config file in Vim (which they never used before), on Unix (which they never used before), with me guiding them by phone (no visuals).
This is the one I recorded that seems to fit into the current theme: https://www.outerthoughts.com/2004/10/perfect-multicast-stor... (tldr: multicasting on 237.0.0.1 is bad).
And if somebody really understands network and multicast, I would love to know whether I actually nailed the problem or just made it go away accidentally. I have no problems with being wrong, especially this much later :-)
Thanks, this is hilarious! "Okay! I'm braking now", definitly my new going to the toilet catch phrase.
I'm very disappointed that the very first entry in the list is a bogus story confirmed to be false.
I mean, if you simply want general computing legends of unconfirmed veracity, read “The Devouring Fungus” by Karla Jennings.
The list desperately needs another story where the title starts with A or B.
The SSH one is brilliant
I find that most of these "hard problems" are something small, so something that's almost unnoticeable to not break immediately but that makes it kinda work.
Now finding exactly what is the trick
for me it was 21 day bug for a broadcast video encoder, some internal frame counter was coded with int instead of int64. would reset every 21 days. fun to debug it was not!
It is good to pore over these legendary tales. They come in handy when we need to break out of the moment and try something outrageous to solve the problem.
Crash cows: “There were often significant food shortages in the Soviet Union, and the government plan was to mix the meat from Chernobyl-area cattle with the uncontaminated meat from the rest of the country. This would lower the average radiation levels of the meat without wasting valuable resources.”
There is some sense to that: low levels of radiation are not a cancer risk last I read - everything we eat is slightly radioactive. That said, I can’t think how significantly radioactive cows could be “diluted“ enough.
It's worth knowing that all the cows you eat are radioactive, just like bananas. If the cow didn't die, you've got a good chance too.
ground beef. but it does seem dubious at best. and also like totally the Soviet Way.
The Crash Bandicoot story is my favourite.
This was my first thought too - "Wonder if tht story with the PS1 controller is in there" - man that must have sucked to debug!
It did.
Always impressed by Mel's story.
That describes a real machine, the LGP-30, which really had absolutely no business being as powerful for the price and era as it was. It used an oscilloscope for its (tiny) debug display: the operator’d read the voltage waveform directly as binary.
https://en.wikipedia.org/wiki/LGP-30
So at my previous employer we still had an old frontend running an atrociously old version of EmberJS. I run the build but it suddenly fails. We were on a three weekly release schedule, so I figure it must have happened somewhere in the past three weeks, about 400 commits. So I start Git Bisecting, teaching it to a handful of my more junior colleagues as I go along. It took us way to long to figure out that the original build also failed however.
So my teammates wish me good luck, and I go off on a journey debugging what the issue actually is. As it turns out, the horribly old Ember.JS CLI package version we're using is a version called '0.2.0-beta'. That did not bode well. This frontend of course did not use the nice yarn dependency pinning, just a regular old package.json file, so I go tracing the error into the dependencies.
Eventually I trace the thing do a dependency nested three layers deep or so. A library added a deprecation warning when being used. That in itself is not so bad, but it did that using an injected logging framework from the package using that library. Except that wasn't introduced until a way later version. Ofcourse this tiny little addition could never cause any breakage, so this was released as a semver bugfix release.
The commit time shows it was 23:00 local time (https://github.com/goldenice/ember-cli-babel/commit/c4c95d6f...) when I figured the problem out and committed a fix. So I submit a PR to the library, figuring that if the author happens to be awake I won't have to figure out a way to pin the dependency to an earlier version (which would have been easy in a regular dependency, but this was a global dependency, where it's not as trivial as switching from npm 2.x or 3.x to yarn)
The author thankfully responds almost immediately, asking me to provide a fallback to console.warn instead of skipping the deprecation entirely. Makes sense, so I update my PR, submit it within a few minutes, and I see that the author immediately publishes a new version. Finally something works out for me. Or so I thought.
As it turns out the author made a tiny stylistic fix in my code. Except that tiny stylistic fix butchers my carefully crafted if statement, and now the code is broken again. It took me a while to figure out that the new version of the dependency WAS being used, but was also broken.
So I contact the author again, explain the situation. They changed the code immediately, pushing out another update. In the meantime I figured out how to do dependency pinning and all was well with the world again.
And that kids, is the story of how I came to appreciate transitive dependency pinning as a really useful feature.
(It's still amazing to me by the way that I can contact somebody that wrote some random code that our code happens to rely on and get a response within half an hour.)