← Back to context

Comment by branko_d

3 days ago

I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):

"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."

Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.

This isn't incentivized in corporate environment.

Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.

The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.

  • This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.

    But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).

    • > For something like Azure, people are nor fungible

      What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.

      12 replies →

    • This is a human problem. We humans praise the doctors that can put the patients with terminal illnesses alive for extended periods, but ignore those who tell us the principles to prevent getting those illnesses in the first place. We throw flowers and money to doctors who treat cancers, but do we do the same to the ones who tell us principles to avoid cancers? No.

      The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.

      That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.

      4 replies →

  • > You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch

    I have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.

    This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.

    So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.

    • People are still trying to figure out how to use AI. Right now the meme is it's used by juniors to churn out slop, but I think people will start to recognize it's far more powerful in the hands of competent senior devs.

      It actually surprised me that you can use AI to write even better code: tell it to write a test to catch the suspected bug, then tell it to fix the bug, then have it write documentation. Maybe also split out related functionality into a new file while you're at it.

      I might have skipped all that pre-AI, but now all that takes 15 minutes. And the bonus creating more understandable code allows AI to fix even more bugs. So it could actually become a virtuous cycle of using AI to clean up debt to understand more code.

      In fact, right now, we're selling technical debt cleanup projects that I've been begging for for years as "we have to do this so the codebase will be more understandable by AI."

      1 reply →

  • Meanwhile, failure to clean up this particular mess was a key factor in losing a trillion dollars in market cap, according to the author.

  • It’s also a customer problem.

    In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.

    Especially for winning over new customers.

    If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.

  • > This isn't incentivized in corporate environment.

    Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.

  • Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.

  • You do but you then make a career out of it : you become the fixer ( and it can be a very good career , either technical or managerial)

No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...

Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.

Once you reach this stage, the only escape is to jump ship. Either mentally or, ideally, truly.

You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.

  • unfortunately, what you will find is that unless you get lucky, the next ship is more of the same.

    The system/management style is ingrained in corporate culture of large-ish companies (i would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large").

    It stems from the fact that when an executive is bestowed the responsibility of managing a company from the shareholders, the responsibility is diluted, and the agent-principle problem rears their ugly head. When several more layers of this starts growing in a large company, the divergence and the path of least resistance is to have zero trust in the "subordinates", lest they make a choice that is contrary to what their managers want.

    The only way to make good software is to have a small, nimble organization, where the craftsman (doing the work) makes the call, gets the rewards, and suffers the consequences (if any). That aligns the agent-principle together.

    • Hierachy is the enemy of succeding projects and information flow. The more important and complex hierarchy in a culture the less likely it is to have a working software industry. Germanys and japanese endless :"old vs young, seniority vs new, internal vs external, company wide management vs project local management come to mind. Its guerilla vs army, startup vs company allover..

      1 reply →

    • > I would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large"

      By that metric, my 50 employee company is "large".

      1 reply →

I was once in such a position. I persuaded management to first cover the entire project with extensive test suite before touching anything. It took us around 3 months to have "good" coverage and then we started refactor of parts that were 100% covered. 5 months in the shareholders got impatient and demanded "results". We were not ready yet and in their mind we were doing nothing. No amount of explanation helped and they thought we are just adding superficial work ("the project worked before and we were shipping new features! Maybe you are just not skilled enough?") Eventually they decided to scrap whole thing. Project was killed and entire team sacked.

  • I’m a developer and if a team spent five months only refactoring with zero features added I would fire you too.

    Refactoring and quality improvements must happen incrementally and in parallel with shipping new features and fixing bugs.

    • I'm a director and one of our teams just spent 8 months doing just that and it was totally justified. They're finally coming up for air and the foundation is significantly improved.

      There's nuance here. Every project/team/org is different.

> first cover everything with tests

Beware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.

> management who do not fully understand the problem nor are incentivized to understand it

They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.

It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.

Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.

  • Customers don’t care about your testing at all. They care that the product works.

    Like most things, the reality is that you need a balance. Integration tests are great for validating complex system interdependencies. They are terrible for testing code paths exhaustively. You need both integration and unit testing to properly evaluate the product. You also need monitoring, because your testing environment will never 100% match what your customers see. (If it does, you’re system is probably trivial, and you don’t need those integration tests anyway.)

    • Integration tests (I think we call them scenario tests in our circles) also only tend to test the happy paths. There is no guarantees that your edge cases and anything unusual such an errors from other tiers are covered. In fact the scenario tests may just be testing mostly the same things as the unit tests but from a different angle. The only way to be sure everything is covered is through fault injection, and/or single-stepping but it’s a lost art. Relying only on automated tests gives a false sense of security.

  • Unit tests are just as important as integration tests as long as they're tightly scoped to business logic and aren't written just to improve coverage. Anything can be done badly, especially if it is quantified and used as a metric of success (Goodhart's law applies).

    Integration tests can be just as bad in this regard. They can be flakey and take hours, give you a false sense of security and not even address the complexity of the business domain.

    I've seen people argue against unit tests because they force you to decompose your system into discrete pieces. I hope that's not the core concern here becuase a well decomposed system is easier to maintain and extend as well as write unit tests for.

    • The problem with unit tests these days is that AI writes them entirely and does a great job at it. That defeats the purpose of unit tests in the first place since the human doesnt have the patience to review the reams of over-mocked test-code produced by AI.

      The end-result of this are things like the code leak of claude code presumably caused by ai generated ci/cd packaging code nobody bothered to review since the attitude is: who reviews test or ci/cd code ? If they break big-deal, ai will fix it.

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs

The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.

  • "Show me the incentives, and I will show you the outcomes" - Charlie Munger

    I once worked in a shop where we had high and inflexible test coverage requirements. Developers eventually figured out that you could run a bunch of random scenarios and then `assert true` in the finally clause of the exception handler. Eventually you'd be guaranteed to cover enough to get by that gate.

    Pushing back on that practice led to a management fight about feature velocity and externally publicized deadlines.

It is so hard to test those codebases too. A lot of the time there's IO and implicit state changes through the code. Even getting testing in place, let alone good testing, is often an incredibly difficult task. And no one will refactor the code to make testing easier because they're too afraid to break the code.

> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.

And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.

  • Language choices won't save you here. The problem is organizational paralysis. Someone sees that the platform is unstable. They demand something be done to improve stability. The next management layer above them demands they reduce the number of changes made to improve stability.

    • Usually this results in approvals to approve the approval to approve making the change. Everyone signed off on a tower of tax forms about the change, no way it can fail now! It failed? We need another layer of approvals before changes can be made!

  • Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.

  • I had a memory management problem so I introduced GC/ref counting and now I have a non-deterministic memory management problem.

    • Ref counting is deterministic. Rust memory management is also deterministic: the memory is freed exactly when the owner of the data gets out of scope (and the borrow checker guarantees at compile time there is no use after that).

      2 replies →

  • They could have started with simple Valgrind sessions before moving to Rust though. Massive number of agents means microservices, and microservices are suitable for profiling/testing like that.

    • Visual Studio has had quite some tooling similar to it, and you can have static analysis turned on all the time.

      SAL also originated with XP SP2 issues.

      Just like there have been toons of tools trying to fix C's flaws.

      However the big issue with opt-in tooling is exactly it being optional, and apparently Microsoft doesn't enforce it internally as much as we thought .

      4 replies →

  • I was waiting for that comment :) Remember that everybody, eventually, calls into code written in C.

    • If 90% of the code I run is in safe rust (including the part that's new and written by me, therefore most likely to introduce bugs) and 10% is in C or unsafe rust, are you saying that has no value?

      Il meglio è l'inimico del bene. Le mieux est l'ennemi du bien. Perfect is the enemy of good.

      2 replies →

    • Depends on which OS we are talking about.

      I know a few where that doesn't hold, including some still being paid for in 2026.

    • If you're sufficiently stubborn, it's certainly possible to call directly into code written in Verilog, held together with inscrutable Perl incantations.

      High-level languages like C certainly have their place, but the space seems competitive these days. Who knows where the future will lead.

      3 replies →

    • It’s worse than that. Eventually everybody calls into code that hits hardware. That is the level that the compiler (ironically?) can no longer make guarantees. Registers change outside the scope of the currently running program all the time. Reading a register can cause other registers on a chip to change. Random chips with access to a shared memory bus can modify the memory that the comipler deduced was static. There be dragons everywhere at the hardware layer and no compiler can ever reason correctly about all of them, because, guess what, rev2 of the hardware could swap a footprint compatible chip clone that has undocumented behavior that. So even if you gave all you board information to the compiler, the program could only be verifiably correct for one potential state of one potential hardware rev.

      3 replies →

  • Did you miss the part that writes about the "all new code is written in Rust" order coming from the top? It also failed miserably.

    • That was quite interesting and now I will take another point of view of the stuff I shared previously.

      However given how Windows team has been anti anything not C++, it is not surprising that it actually happened like that.

      4 replies →

Though this doesn't make much sense on its surface - a bug means something is already broken, and he tells of millions of crashes per month, so it was visibly broken. 100% chance of being broken (bug) > some chance of breakage from fixing it

(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")

  • I've experienced a nearly identical scenario where a large fleet of identical servers (Citrix session hosts) were crashing at a "rate" high enough that I had to "scale up" my crash dump collection scripts with automated analysis, distribution into about a hundred buckets, and then per-bucket statistical analysis of the variables. I had to compress, archive, and then simply throw away crash dumps because I had too many.

    It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.

    I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.

    I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.

    "No." was the firm answer from management.

    "Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.

    You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.

    • Yeah, long periods of total disfunction get ingrained

      Though just to ref my original point

      > burned before, and they're never touching the stove again

      Except they are sitting on the stove with their asses burning, which cuts all the needed cooling off their heads!

      1 reply →

Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.

Isn't this where Oracle is with their DB? Wasn't HN complaining about that?

Or to simplify the product and rebuild.

  • “Rebuild” is also a four-letter word though at this stage too. The customer has a panel of knob-and-tube wiring and aluminum paper-wrapped wire in the house. They want a new hot tub. They don’t want some electrician telling them they need to completely rewire their house first at huge expense, such that they cannot afford the hot tub anymore. They’ll just throw the electrician out and get some kid in a pickup truck (“You’re Absolutely Right Handyman LLC”) to run a lamp cord to their new hot tub. Once the house burns to the ground, the new owners will wire their new construction correctly.

  • Exactly. But he’s right about management, first the problem must be acknowledged and that may make some people look bad.

writing tests and then meticulously fixing bugs does not increase shareholders' value.

  • Dave Cutler and his team are a clear counter-example. They famously shipped Windows NT with zero known bugs, which clearly brought enormous shareholder value.

    The problem, of course, is that this sort of thing doesn’t bring value next quarter.

once you reach the stage, the only escape is to give up on it. and move on.

somethings are beyond your control and capabilities

if the service is so shitty, why are people paying so much fucking money for it?

is microsoft committing an accounting fraud?

  • I worked at a startup that was using Azure. The reason was simple enough - it had been founded by finance people who were used to Excel, so Windows+Office was the non-negotiable first bit of IT they purchased. That created a sales channel Microsoft used to offer generous startup credits. The free money created a structural lack of discipline around spending. Once the startup credits ran out, the company became faced with a huge bill and difficulty motivating people to conserve funds.

    At the start I didn't have any strong opinion on what cloud provider to use. I did want to do IT the "old fashioned way" - rent a big ass bare metal or cloud VM, issue UNIX user accounts on it and let people do dev/test/ad hoc servers on that. Very easy to control spending that way, very easy to quickly see what's using the resources and impose limits, link programs to people, etc. I was overruled as obviously old fashioned and not getting with the cloud programme. They ended up bleeding a million dollars a month and the company wasn't even running a SaaS!

    I ended up with a very low opinion of Azure. Basic things like TCP connections between VMs would mysteriously hang. We got MS to investigate, they made a token effort and basically just admitted defeat. I raged that this was absurd as working TCP is table stakes for literally any datacenter since the 1980s, but - sad to say - at this time Azure's bad behavior was enabled by a widespread culture of CV farming in which "enterprise" devs were all obsessed with getting cloud tech onto their LinkedIn. Any time we hit bugs or stupidities in the way Azure worked I was told the problem was clearly with the software I'd written, which couldn't be "cloud native", as if it was it'd obviously work fine in Azure!

    With attitudes like that completely endemic outside of the tech sector, of course Microsoft learned not to prioritize quality.

    We did eventually diversify a bit. We needed to benchmark our server software reliably and that was impossible in Azure because it was so overloaded and full of noisy neighbours, so we rented bare metal servers in OVH to do that. It worked OK.

    • "Basic things like TCP connections between VMs would mysteriously hang"

      This is like a car that can't even get you two blocks from home. Amazing.

    • I have had bad experiences across all major vendors.

      The main reason I used to push for Azure instead during the last years was the friendliness of their Web UIs, and having the VS Code integration (it started as an Azure product after all).

      2 replies →

  • Corporate inertia. Sibling comment uses the term "hostage situation" which I admit is pretty apt.

    Microsoft is an approved vendor in every large enterprise. That they have been approved for desktop productivity, Sharepoint, email and on-prem systems does not enter the picture. That would be too nuanced.

    Dealing with a Large Enterprise[tm] is an exercise in frustration. A particular client had to be deployed to Azure because their estimate was that getting a new cloud vendor approved for production deployments would be a gargantuan 18-to-24 month org-wide and politically fraught process.

    If you are a large corp and have to move workloads to the cloud (because let's be honest: maintaining your own data centres and hardware procurement pipelines is a serious drag) then you go with whatever vendor your organisation has approved. And if the only pre-approved vendor with a cloud offering is Microsoft, you use Azure.

  • Because Azure customers are companies that still, in 2026 only use Windows. Anyone else uses something else. Turns out, companies like that don't tend to have the best engineering teams. So moving an entire cloud infrastructure from Azure to say AWS, probably is either really expensive, really risky or too disruptive to do for the type of engineering team that Azure customers have. I would expect MS to bleed from this slowly for a long time until they actually fix it. I seriously doubt they ever will but stranger things have happened.

    • Turns out outside companies shipping software products aspiring to be the next Google or Apple, most companies that work outside software industry also need software to run their business and they couldn't care less about HN technology cool factor.

      They use whatever they can to ship their products into trucks, outsourcing their IT and development costs , and that is about it.

      1 reply →

  • I have worked at two retail companies where AWS was a no no. They didn't want to have anything depending on a competitor(Amazon). So they went the Azure route.

  • CFOs love it because Microsoft does bundle pricing with office. Plus they love to give large credits to bootstrap lock-in.

  • You’re assuming the alternatives don’t have just as many issues. There’s been exactly one “whistleblower” who is probably tiptoeing the line of a lawsuit. I wouldn’t assume just because there isn’t a similar disgruntled gcp or aws engineer doesn't mean they don't have similar ways.

  • this made me look into how cloud hypervisors actually work on HW level.. they all offload it to custom HW (smart nic, fpga, dpu, etc..). cpu does almost nothing except for tenant work. AWS -> Nitro, Azure -> FPGA, NVIDIA sells DPUs.

    Here is interactive visual guide if anyone wants to explore - https://vectree.io/c/cloud-virtualization-hardware-nitro-cat...

    • VM management does not run on the FPGA; it’s regular Win32 software on Windows, with aspirations to run some equivalent, someday, on the SoC next to the FPGA on the NIC. The programmable hardware is used for network paths and PCIe functions, where it can project NICs and NVMe devices to VMs to bypass software-based, VMBus-backed virtual devices, all of which end up being serviced on the host who controls the real hardware. Lookup SR-IOV for the bypass. So yes, that’s I/O bypass/offload, but the VM management stack offload is a distinct thing that does not require an FPGA, just a SoC.

  • Yeah it’s entirely business people and executives who make these decisions in most companies. Not the ones who use it or implement on it.

  • Depending on the space you work in, you have almost no choice at all. If you're building for government then you're going to use Microsoft, almost "end of story".

  • most the upper management of companies who use them have dont have the technical competence to see it. (eg: banks, supermarket chains, manufacturing companies)

    once they are in, no one likes to admit they made a mistake.

  • Because the alternatives are also in similar state.

    AWS or GCP are all pretty crap. You use any of them, any you'll hit just enough rough edges. The whole industry is just grinding out slop, quality is not important anywhere.

    I work with AWS on a daily basis, and I'm not really impressed. (Also nor did GCP impress me on the short encounter I had with it)

    • I don't know about AWS or the rest of GCP, but in terms of engineering, my experience of GCE was at least an entire order of magnitude better than what the article alleges about Azure. Security and reliability were taken extremely seriously, and the quality of the engineering was world-class. I hope it has stayed like this since then. It was a worthwhile thing to experience.

    • This isn't it at all. AWS does not have the same sorts of insane cross-tenancy exploits that Azure has had, for example.

      The reason that Azure has so many customers is very simply because Azure is borderline mandated by the US government.