Comment by jiggawatts

3 days ago

> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.

Or… you’ve just normalised the deviation.

One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.

After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.

As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.

You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.

Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…

You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!

> One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.

Wellllll ... my observations after many cycles of this are:

- wtfs/day exclaimed by people interacting with *a new codebase* are not indicative of anything. People first encountering the internals of any reasonably interesting system will always be baffled. In this context "wtf" might just mean "learning something new".

- wtfs/day exclaimed by people learning about your *processes and workflows* are extremely important and should be taken extremely seriously. "wtf, did you know all your junior devs are sharing a single admin API token over email?" for example.

> One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.

Eh, I don't think this is exactly as reliable as you'd expect.

My previous job had a fairly straight forward code base but had fairly poor reliability for the few customers we had, and the WTF portions usually weren't the ones that caused downtime.

On the other hand, I'm currently working on a legacy system with daily WTFs from pretty much everyone, with a greater degree of complexity in a number of places, and yet we get fewer bug reports and at least an order of magnitude if not two more daily users.

With all of that said... I don't think I've used any of Microsoft's new software in years and thought to myself "this feels like it was well made."

  • The rapid decay of WTF/day over time applies to both new employees and new customers.

    > currently working on a legacy system

    "Legacy" is the magic word here! Those customers are pissed, trust me, but they've long ago given up trying to do anything about it. That's why you don't hear about it. Not because there are no bugs, but because nobody can be bothered to submit bug reports after learning long ago that doing so is futile.

    I once read a paper claiming that for every major software incident (crash, data loss, outage, etc...) between only one in a thousand to one in ten thousand will be formally reported up to an engineer capable of fixing the issue.

    I refused to believe that metric until I started collecting crash reports (and other stats) automatically on a legacy system and discovered to my horror that it was crashing multiple times per user per day, and required on average a backup restore once a week or so per user due to data corruption! We got about one support call per 4,500 such incidents.

    • The customers aren't pissed, we're doing demos to new departments and lining up customizations and expansion as quickly as we can. We're growing faster than ever within our largest customer.

      I also didn't say there are no bugs or complaints, I said the system is more stable. But yes, there are fewer bugs and complaints, especially on the critical features.

      I didn't use the word legacy to mean abandoned, just that it's been around a long time, we're maintaining it while also building newer features in newer tech, as opposed to my previous company which was a green field startup.

      1 reply →

I mean, the org had already decreed everything needed to be rewritten in Rust according to the account.