Comment by jorl17

14 days ago

This is only very tangentially related, but I got flashbacks to a time where we had dozens of edge/IoT raspberry pi devices with completely unupgradeable kernels with a bug that would make the whole USB stack shut down after "roughly a week" (7-9 days) of uptime. Once it got shut down, the only way to fix it was to do a full restart, and, at the time, we couldn't really be restarting those devices (not even at night).

This means that every single device would seemingly randomly completely break: touchscreen, keyboard, modems, you name it. Everything broke. And since the modem was part of it, we would lose access to the device — very hard to solve because maintenance teams were sometimes hours (& flights!) away.

It seemed to happen at random, and it was very hard to trace it down because we were also gearing up for an absolutely massive (hundreds of devices, and then a couple of months later, thousands) launch, and had pretty much every conceivable issue thrown at us, from faulty USB hubs, broken modems (which would also kill the USB hub if they pulled too much power), and I'm sure I've forgotten a bunch of other issues.

Plus, since the problem took a week to manifest, we couldn't really iterate on fixes quickly - after deploying a "potential fix", we'd have to wait a whole week to actually see if it worked. I can vividly remember the joy I had when I managed to get the issue to consistently happen only in the span of 2 hours instead of a week. I had no idea _why_, but at least I could now get serviceable feedback loops.

Eventually, after trying to mess with every variable we could, and isolating this specific issue from the other ones, we somehow figured out that the issue was indeed a bug in the kernel, or at least in one of its drivers: https://github.com/raspberrypi/linux/issues/5088 . We had many serial ports and a pattern of opening and closing them which triggered the issue. Upgrading the kernel was impossible due to a specific vendor lock-in, and we had to fix live devices and ship hundreds of them in less than a month.

In the end, we managed to build several layers on top of this unpatchable ever-growing USB-incapacitating bug: (i) we changed our serial port access patterns to significantly reduce the frequency of crashes; (ii) we adjusted boot parameters to make it much harder to trigger (aka "throw more memory at the memory leak"); (iii) we built a system that proactively detected the issue and triggered a USB reset in a very controlled fashion (this would sometimes kill the network of the device for a while, but we had no choice!); (iv) if, for some reason, all else failed, a watchdog would still reboot the system (but we really _really_ _reaaaally_ didn't want this to happen).

In a way, even though these issues suck, it's when we are faced with them that we really grow. We need to grab our whole troubleshooting arsenal, do things that would otherwise feel "wrong" or "inelegant", and push through the issues. Just thinking back to that period, I'm engulfed by a mix of gratitude for how much I learned, and an uneasy sense of dread (what if next time I won't be able to figure it out)?

6 comments

jorl17

nomel 14 days ago

Even National Instruments had this type of bug in their nivisa driver, that powers a good portion of lab and test equipment of the world. Every 31 days our test equipment would stop working, which happens to be the overflow of one of the windows timers. was also one of the fasted bug fix updates I ever saw, after reporting it!

BoredomIsFun 14 days ago

I've always been sceptical of the modern tendency of throwing powerful hardware at every embedded projects. In most cases good old atmel AVR or even 8051 would suffice.

jorl17 14 days ago

I think I used to have that view as well, and in a way still do, but this particular project proved otherwise.
The first version was built pretty much that way, with a tiny microcontroller and extremely optimized code. The problem then became that it was very hard to iterate quickly on it and prototype new features. Every new piece of hardware that was added (or just evaluated) would have to be carefully integrated and it really added to the mess. Maybe it would have been different if the code had been structured with more care from the get-go, who knows (I entered the project already in version 2).
For version 2, the micro-controller was thrown out, and raspberry-pi based solutions were brought in. Sure, it felt like carrying a shotgun to fire at a couple of flies, but having a linux machine with such a vast ecosystem was amazing. On top of that, it was much easier to hire people to work on the project because now they could get by with higher level languages like python and javascript. And it was much, much, much faster to develop on.
The usage of the raspberry pi was, in my view, one of the key details that allowed for what ultimately became an extremely successful product. It was much less energy-efficient, but it was very simple to develop and iterate on. In the span of months we experimented with many hardware addons, as product-market-fit was still being found out, and the plethora of online resources for everything else was a boon.
I'm pretty sure this was _the_ project that really made me realize that more often than not the right solution is the one that lets the right people make the right decisions. And for that particular team, this was, without a doubt, a remarkably successful decision. Most of the problems that typically come with it (such as bloat, and inefficiency) were eventually solved, something which would not have been possible by going slowly at first.

nottorp 14 days ago

A week? I've had some Pis lose usb in 1-2 days. Fortunately we could afford to make them self restart every couple hours.

locao 14 days ago
I also had the same experience, but I could only make them restart during the night. So I wrote a monitor to check if any of the Pis lost USB before restarting.
When our business grew, even restarting every night, we would get one or two lost USB warnings every day. One day I didn't receive any warnings. I was really happy, I had fix the issue! Three days later a client calls screaming the service is not working for two whole days and we did nothing. After getting every Pi restarted, I went to check the monitor. Shut down. I asked my business partner about it. "The alarms made me anxious, so I decided to shut down the monitor".
Obviously I sold my shares and never looked back.
- nottorp 14 days ago
  
  Ah well. Our project was Pis in a crappy mesh network so it lost data occasionally even if they stayed on, and it was not so important to have continous data anyway. We rebooted them every like 3 or 6 hours.