Comment by GranPC

2 months ago

Considering that you were seeing unpredictable behavior in the boot selector, with it randomly freezing, I would assume a hardware component (RAM?) kicked the bucket. If it were firmware corruption, it would consistently fail to present the menu, or wouldn't boot at all.

Microsoft's code quality might not be at its peak right now, but blaming them for what's most likely a hardware fault isn't very productive IMO.

58 comments

GranPC

Aurornis 2 months ago

Yeah, I agree. This just feels like an appeal to anti-Microsoft clicks

From the article:

> It won’t get past the Snapdragon boot logo before rebooting or powering off… again, seemingly at random.

Random freezing at different points of the boot process suggests a hardware failure, not something broken in the software boot chain.

colechristensen 2 months ago

It could very well be something poorly configured in the boot chain leading to random failures. There are plenty of hardware things configured in software which can lead to plenty of different kinds of random failures.
numpad0 2 months ago

Or something hit max program-erase cycle counts and are returning corrupt/old data. Flash ROMs tend to become "sticky" with previous states as you write more to them. I think it's possible that ROMs used for early SoC boot firmware or peripherals firmware still don't have wear leveling that they could become unusable after just a hundred or so of writes.
ErroneousBosh 2 months ago
> Random freezing at different points of the boot process suggests a hardware failure, not something broken in the software boot chain.
Power issues all day long. It'll be fine until the SoC enables enough peripherals for one of the rails to sag down.
That being said, it's a hell of a coincidence that it failed exactly when a software update failed.
- nottorp 2 months ago
  
  Firmware update changed the order in which power is enabled to peripherals?
  
  2 replies →
fortran77 2 months ago
> This just feels like an appeal to anti-Microsoft clicks
Exactly. Did you notice the one comment on his blog? It's a Linux zealot saying "Linux".
- Dylan16807 2 months ago
  
  You're trying to take too much from that single comment sample point.
- NewJazz 2 months ago
  
  How do you know they are a zealot?
- bigyabai 2 months ago
  
  Deductive reasoning based on a sample size of one, HN comments are beyond the reach of parody.
- ArtRichards 2 months ago
  
  I guess he's being a bit cheeky ;)

shakna 2 months ago

The update listed in the article contains two UEFI patches, intended for "handheld devices".

It would be entirely unsurprising to me if this trashed UEFI for this particular ARM device, from firmware corruption.

tallanvor 2 months ago

That's plausible, but I'd expect the UEFI patches to come from a vendor, not Microsoft. So if one came from Qualcomm, and they didn't properly specify the devices it should be installed on, that wouldn't make it Microsoft's fault.

ankurdhama 2 months ago

So the "hardware failure" happening exactly at the same time the Windows update installation failed are not related? That sounds like a one in a billion kind of coincident.

eli 2 months ago
An upgrade process involves heavy CPU use, disk read/writes, and at least a few power cycles in short time period. Depending what OP was doing on it otherwise, it could've been the highest temperature the device had ever seen. It's not so crazy.
My guess would've been SSD failure, which would make sense to seem to appear after lots of writes. In the olden days I used to cross my fingers when rebooting spinning disk servers with very long uptimes because it was known there was a chance they wouldn't come back up even though they were running fine.
- jonathanlydall 2 months ago
  
  Not for a server, but many years ago my brother had his work desktop fail after he let it cold boot for the first time in a very long time.
  Normally he would leave his work machine turned on but locked when leaving the office.
  Office was having electrical work done and asked that all employees unplug their machines over the weekend just in case of a surge or something.
  On the Monday my brother plugged in machine and it wouldn’t turn on. Initially the IT guy remarked that my brother didn’t follow the instructions to unplug it.
  He later retracted the comment after it was determined the power supply capacitors had gone bad a while back, but the issue with them was not apparent until they had a chance to cool down.
- GCUMstlyHarmls 2 months ago
  
  > In the olden days I used to cross my fingers when rebooting spinning disk servers with very long uptimes because it was known there was a chance they wouldn't come back up even though they were running fine.
  HA! Not just me then!
  I still have an uneasy feeling in my guts doing reboots, especially on AM5 where the initial memory timing can take 30s or so.
  I think most of my "huh, its broken now?" experiences as a youth were probably the actual install getting wonky though, rather than the few rare "it exploded" hardware failures after reboot, though that definitely happened.
- zelon88 2 months ago
  
  This, 100%.
  I'd like to add my reasoning for a similar failure of an HP Proliant server I encountered.
  Sometimes hardware can fail during long uptime and not become a problem until the next reboot. Consider a piece of hardware with 100 features. During typical use, the hardware may only use 50 of those features. Imagine one of the unused features has failed. This would not cause a catastrophic failure during typical use, but on startup (which rarely occurs) that feature is necessary and the system will not boot without it. If it could, it could still perform it's task... because the damaged feature is not needed. But it can't get past the boot phase, where the feature is required.
  Tl;dr the system actually failed months ago and the user didn't notice because the missing feature was not needed again until the next reboot.
  
  3 replies →
- SecretDreams 2 months ago
  
  > Depending what OP was doing on it otherwise, it could've been the highest temperature the device had ever seen. It's not so crazy.
  Kind of big doubt. This was probably not slamming the hardware.
  
  1 reply →
tobyjsullivan 2 months ago
Over my 35 years of computer use, most hardware failures (very, very rare) happen during a reboot or power-on. And most of my reboots happen when installing updates. It actually seems like a very high probability in my limited experience.
Of course, it’s possible that the windows update was a factor, when combined with other conditions.
- fc417fc802 2 months ago
  
  There's also the case where the hardware has failed but the system is already up so it just keeps running. It's when you finally go to reboot that everything falls apart in a visible manner.
  
  6 replies →
GranPC 2 months ago

For all we know, this thing was on its last legs (these machines do run very hot!) and the update process might have been the final nail in the coffin. That doesn't mean Microsoft set out to kill OP's machine... Same thing could have happened if OP ran make -j8 -- we wouldn't blame GNU make.
wnevets 2 months ago

This reminds me of the 3090 hardware problems being exposed by Amazons New World [1]. Everyone really wanted to blame the software.
https://www.pcgamer.com/amazon-new-world-killing-rtx-3090-gp...
Graziano_M 2 months ago
I had a friend's dad's computer's HDD fail while I was installing Linux on it to show him it. That was terrifying. I still remember the error, and I just left with it (and Windows) unable to boot. Later my friend told me that the drive was toast.
Come to think of it, maybe it was me. I might have trashed the MBR? I remember the error, though, "Non system disk or disk error".
- toast0 2 months ago
  
  IIRC, that error text comes from the mbr. You may have trashed the partition table?
  
  1 reply →
- justinclift 2 months ago
  
  Yeah, sounds like the drive was still physically detected but that the expected boot loader wasn't present any more.
wvenable 2 months ago

If had happened any other time, there wouldn't be a blog post about it and we wouldn't be reading about it.
olyjohn 2 months ago
I've fixed thousands of PCs and Macs over my career. Coincidences like this happen all the time. I mean, have you seen the frequency of updates these days? There are always some kind of updates happening. So the chances of your system breaking during an update is not actually that slim.
- helf 2 months ago
  
  [dead]
pdpi 2 months ago

I think it's fair to say they're related, yes. But causality can well be the other way around — that Windows upgrade failed because of flaky hardware.
santoshalper 2 months ago

Two bugs occurring at the same time is definitely not one in a billion, and with billions of computers in the world, weird shit is going to happen.
Aurornis 2 months ago

> That sounds like a one in a billion kind of coincident
Hardware is more likely to fail under load than at idle.
Blaming the last thing that was happening before hardware failed isn't a good conclusion, especially when the failure mode manifests as random startup failures instead of a predictable stop at some software stage.
nightfly 2 months ago

windows update just doing a normal write causing the active chunk of flash memory being used to hold something in the boot loader to a different failed/failing section
taneq 2 months ago

A software update can absolutely trigger or unmask a hardware bug. It’s not an either/or thing, it’s usually (if a hardware issue is actually present) both in tandem:
ezfe 2 months ago

This happens all the time, people always doubt it - but the patterns are always consistent: large updates kill hardware that's in progress of failing
justsomehnguy 2 months ago

"Hardware failure" => "WinUpdate failure" => "Corrupted system" conforms the Occam's razor.
croes 2 months ago

Like winning the lottery?
Happens quite often

everforward 2 months ago

I'm not so sure, I've had a similar-ish issue on a W10 PC. I vaguely suspect a race condition on one of the drivers; I've specifically got my eye on the esp32 flashing drivers.

Sometimes it boots fine, sometimes the spinning dial disappears and it gets hung on the black screen, sometimes it hangs during the spinning dial and freezes, and very occasionally blue screens with a DPC watchdog violation. Oddly, it can happen during Safe Mode boots as well.

I would think hardware, but RAM has been replaced and all is well once it boots up. I can redline the CPU and GPU at the same time with no issues.

p0w3n3d 2 months ago

when something works flawlessly and starts to fail after an update (so no user actions there) this could mean that update made the hardware fail. For example overuse of flash in ssd (it's been already reported https://community.spiceworks.com/t/anyone-else-have-their-ss...) or reflashing a component too many times (simple error in scripts)

MrGilbert 2 months ago
Or it might not be related at all. Correlation and causality might be in charge here.
- p0w3n3d 2 months ago
  
  Doesn't Ockham's Razor tell us to check the more obvious things first?

jasoneckert 2 months ago

Sage point. But it actually turned out to be non-hardware-related (I added an "Update: It's alive!" section to the blog post).

inferiorhuman 2 months ago

With the original Arduino Due there was some fun undocumented behavior with the MCU (an Atmel Cortex-M3) where it would do random things at boot unless you installed a 10k resistor. From booting off of flash or ROM at random to clocks not coming up correctly.

I swear I was doing just fine with it booting reliably until I decided to try flashing it over the SWD interface. But wouldn't you know it, soldering a resistor fixed it. Mostly.

deckar01 2 months ago

I would test the CPU cooler since the fans ran so hard. Temps ramp up around the login screen, then stay hot and reboots get unpredictable.

I recently had a water cooler pump die during a Windows update. The pump was going out, but the unthrottled update getting stuck on a monster CPU finished it off.