Comment by branko_d

2 months ago

I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):

"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."

Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.

127 comments

branko_d

praptak 2 months ago

This isn't incentivized in corporate environment.

Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.

The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.

InsideOutSanta 2 months ago
This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.
But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
- delusional 2 months ago
  
  > For something like Azure, people are nor fungible
  What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
  
  12 replies →
- markus_zhang 2 months ago
  
  This is a human problem. We humans praise the doctors that can put the patients with terminal illnesses alive for extended periods, but ignore those who tell us the principles to prevent getting those illnesses in the first place. We throw flowers and money to doctors who treat cancers, but do we do the same to the ones who tell us principles to avoid cancers? No.
  The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.
  That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.
  
  4 replies →
- salemh 2 months ago
  
  [dead]
steveBK123 2 months ago
> You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch
I have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.
This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.
So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.
- asdfman123 2 months ago
  
  People are still trying to figure out how to use AI. Right now the meme is it's used by juniors to churn out slop, but I think people will start to recognize it's far more powerful in the hands of competent senior devs.
  It actually surprised me that you can use AI to write even better code: tell it to write a test to catch the suspected bug, then tell it to fix the bug, then have it write documentation. Maybe also split out related functionality into a new file while you're at it.
  I might have skipped all that pre-AI, but now all that takes 15 minutes. And the bonus creating more understandable code allows AI to fix even more bugs. So it could actually become a virtuous cycle of using AI to clean up debt to understand more code.
  In fact, right now, we're selling technical debt cleanup projects that I've been begging for for years as "we have to do this so the codebase will be more understandable by AI."
  
  1 reply →
raxxorraxor 2 months ago

Problem is that if you want to be a serious cloud provider, you have to do exactly that. I slowly move my apps off of any Microsoft services, because they tend to be slow and buggy.
Also they too often remove features of their products and I have no desire to migrate working stuff because MS wants to move people to other products.
And these tend to be worse in recent times. Exemplary for that is PowerAutomate for me. Theoretically a neat tool that is well integrated into the cloud landscape. Practically you cannot implement reliable workflows with it because of numerous reasons.
> If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.
Well, it doesn't explode, but I really question how reliable some of these systems really are. In my experience, not at all. There was or is some genuinely good engineering below some of these systems, but I think all the buggy fluff build upon it really introduces friction.
jimbokun 2 months ago

Meanwhile, failure to clean up this particular mess was a key factor in losing a trillion dollars in market cap, according to the author.
registeredcorn 2 months ago

Perhaps an important question is: why is it not incentivized in corporate environments?
I think, however, that perhaps I'm asking in the wrong arena. Unless there are people here reading this who work in the areas of a corporate environment at the level at which those decisions are made, it would really amount to guessing and stereotypes. Generally, I like to think that just about anyone can grasp that a well-made product will sell better due to its nature. I think that there must be some kind of mutual disconnect between both sides where one continues to see improvements important, and the other fundamentally does not (or does not have a functional means to measure and verify it).
cineticdaffodil 2 months ago

Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.
BobbyTables2 2 months ago

It’s also a customer problem.
In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.
Especially for winning over new customers.
If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.
philipallstar 2 months ago

> This isn't incentivized in corporate environment.
Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.
Agingcoder 2 months ago

You do but you then make a career out of it : you become the fixer ( and it can be a very good career , either technical or managerial)

monocasa 2 months ago

No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...

Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.

gherkinnn 2 months ago

Once you reach this stage, the only escape is to jump ship. Either mentally or, ideally, truly.

You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.

chii 2 months ago
unfortunately, what you will find is that unless you get lucky, the next ship is more of the same.
The system/management style is ingrained in corporate culture of large-ish companies (i would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large").
It stems from the fact that when an executive is bestowed the responsibility of managing a company from the shareholders, the responsibility is diluted, and the agent-principle problem rears their ugly head. When several more layers of this starts growing in a large company, the divergence and the path of least resistance is to have zero trust in the "subordinates", lest they make a choice that is contrary to what their managers want.
The only way to make good software is to have a small, nimble organization, where the craftsman (doing the work) makes the call, gets the rewards, and suffers the consequences (if any). That aligns the agent-principle together.
- cineticdaffodil 2 months ago
  
  Hierachy is the enemy of succeding projects and information flow. The more important and complex hierarchy in a culture the less likely it is to have a working software industry. Germanys and japanese endless :"old vs young, seniority vs new, internal vs external, company wide management vs project local management come to mind. Its guerilla vs army, startup vs company allover..
  
  1 reply →
- bigstrat2003 2 months ago
  
  > I would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large"
  By that metric, my 50 employee company is "large".
  
  1 reply →

varispeed 2 months ago

I was once in such a position. I persuaded management to first cover the entire project with extensive test suite before touching anything. It took us around 3 months to have "good" coverage and then we started refactor of parts that were 100% covered. 5 months in the shareholders got impatient and demanded "results". We were not ready yet and in their mind we were doing nothing. No amount of explanation helped and they thought we are just adding superficial work ("the project worked before and we were shipping new features! Maybe you are just not skilled enough?") Eventually they decided to scrap whole thing. Project was killed and entire team sacked.

jimbokun 2 months ago
I’m a developer and if a team spent five months only refactoring with zero features added I would fire you too.
Refactoring and quality improvements must happen incrementally and in parallel with shipping new features and fixing bugs.
- bmurphy1976 2 months ago
  
  I'm a director and one of our teams just spent 8 months doing just that and it was totally justified. They're finally coming up for air and the foundation is significantly improved.
  There's nuance here. Every project/team/org is different.
- eviks 2 months ago
  
  Welcome to Microsoft! Enjoy the ever-growing backlog of bugs to fix!

bob1029 2 months ago

> first cover everything with tests

Beware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.

> management who do not fully understand the problem nor are incentivized to understand it

They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.

It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.

Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.

dpark 2 months ago
Customers don’t care about your testing at all. They care that the product works.
Like most things, the reality is that you need a balance. Integration tests are great for validating complex system interdependencies. They are terrible for testing code paths exhaustively. You need both integration and unit testing to properly evaluate the product. You also need monitoring, because your testing environment will never 100% match what your customers see. (If it does, you’re system is probably trivial, and you don’t need those integration tests anyway.)
- axelriet 2 months ago
  
  Integration tests (I think we call them scenario tests in our circles) also only tend to test the happy paths. There is no guarantees that your edge cases and anything unusual such an errors from other tiers are covered. In fact the scenario tests may just be testing mostly the same things as the unit tests but from a different angle. The only way to be sure everything is covered is through fault injection, and/or single-stepping but it’s a lost art. Relying only on automated tests gives a false sense of security.
caoilte 2 months ago
Unit tests are just as important as integration tests as long as they're tightly scoped to business logic and aren't written just to improve coverage. Anything can be done badly, especially if it is quantified and used as a metric of success (Goodhart's law applies).
Integration tests can be just as bad in this regard. They can be flakey and take hours, give you a false sense of security and not even address the complexity of the business domain.
I've seen people argue against unit tests because they force you to decompose your system into discrete pieces. I hope that's not the core concern here becuase a well decomposed system is easier to maintain and extend as well as write unit tests for.
- bwfan123 2 months ago
  
  The problem with unit tests these days is that AI writes them entirely and does a great job at it. That defeats the purpose of unit tests in the first place since the human doesnt have the patience to review the reams of over-mocked test-code produced by AI.
  The end-result of this are things like the code leak of claude code presumably caused by ai generated ci/cd packaging code nobody bothered to review since the attitude is: who reviews test or ci/cd code ? If they break big-deal, ai will fix it.
- senderista 2 months ago
  
  “Premature abstraction” forced by unit tests can make systems harder to maintain.
  
  3 replies →
carols10cents 2 months ago

If you're writing the tests after writing the code, you're not doing TDD though.

hikarudo 2 months ago

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs

The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.

coredog64 2 months ago

"Show me the incentives, and I will show you the outcomes" - Charlie Munger
I once worked in a shop where we had high and inflexible test coverage requirements. Developers eventually figured out that you could run a bunch of random scenarios and then `assert true` in the finally clause of the exception handler. Eventually you'd be guaranteed to cover enough to get by that gate.
Pushing back on that practice led to a management fight about feature velocity and externally publicized deadlines.

staticassertion 2 months ago

It is so hard to test those codebases too. A lot of the time there's IO and implicit state changes through the code. Even getting testing in place, let alone good testing, is often an incredibly difficult task. And no one will refactor the code to make testing easier because they're too afraid to break the code.

dbdr 2 months ago

> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.

And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.

sidewndr46 2 months ago
Language choices won't save you here. The problem is organizational paralysis. Someone sees that the platform is unstable. They demand something be done to improve stability. The next management layer above them demands they reduce the number of changes made to improve stability.
- teeray 2 months ago
  
  Usually this results in approvals to approve the approval to approve making the change. Everyone signed off on a tower of tax forms about the change, no way it can fail now! It failed? We need another layer of approvals before changes can be made!
- cogman10 2 months ago
  
  Yeah I've seen that move pulled. Funnily enough by an ex-Microsoft manager.
mike_hearn 2 months ago
Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.
- cineticdaffodil 2 months ago
  
  In a rewrite you can smuggle in a quality lift
CoolGuySteve 2 months ago
I had a memory management problem so I introduced GC/ref counting and now I have a non-deterministic memory management problem.
- dbdr 2 months ago
  
  Ref counting is deterministic. Rust memory management is also deterministic: the memory is freed exactly when the owner of the data gets out of scope (and the borrow checker guarantees at compile time there is no use after that).
  
  2 replies →
bayindirh 2 months ago
They could have started with simple Valgrind sessions before moving to Rust though. Massive number of agents means microservices, and microservices are suitable for profiling/testing like that.
- pjmlp 2 months ago
  
  Visual Studio has had quite some tooling similar to it, and you can have static analysis turned on all the time.
  SAL also originated with XP SP2 issues.
  Just like there have been toons of tools trying to fix C's flaws.
  However the big issue with opt-in tooling is exactly it being optional, and apparently Microsoft doesn't enforce it internally as much as we thought .
  
  4 replies →
axelriet 2 months ago
I was waiting for that comment :) Remember that everybody, eventually, calls into code written in C.
- dbdr 2 months ago
  
  If 90% of the code I run is in safe rust (including the part that's new and written by me, therefore most likely to introduce bugs) and 10% is in C or unsafe rust, are you saying that has no value?
  Il meglio è l'inimico del bene. Le mieux est l'ennemi du bien. Perfect is the enemy of good.
  
  2 replies →
- pjmlp 2 months ago
  
  Depends on which OS we are talking about.
  I know a few where that doesn't hold, including some still being paid for in 2026.
- tux3 2 months ago
  
  If you're sufficiently stubborn, it's certainly possible to call directly into code written in Verilog, held together with inscrutable Perl incantations.
  High-level languages like C certainly have their place, but the space seems competitive these days. Who knows where the future will lead.
  
  3 replies →
- milesvp 2 months ago
  
  It’s worse than that. Eventually everybody calls into code that hits hardware. That is the level that the compiler (ironically?) can no longer make guarantees. Registers change outside the scope of the currently running program all the time. Reading a register can cause other registers on a chip to change. Random chips with access to a shared memory bus can modify the memory that the comipler deduced was static. There be dragons everywhere at the hardware layer and no compiler can ever reason correctly about all of them, because, guess what, rev2 of the hardware could swap a footprint compatible chip clone that has undocumented behavior that. So even if you gave all you board information to the compiler, the program could only be verifiably correct for one potential state of one potential hardware rev.
  
  3 replies →
flohofwoe 2 months ago
Did you miss the part that writes about the "all new code is written in Rust" order coming from the top? It also failed miserably.
- pjmlp 2 months ago
  
  That was quite interesting and now I will take another point of view of the stuff I shared previously.
  However given how Windows team has been anti anything not C++, it is not surprising that it actually happened like that.
  
  5 replies →

neya 2 months ago

Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.

eviks 2 months ago

Though this doesn't make much sense on its surface - a bug means something is already broken, and he tells of millions of crashes per month, so it was visibly broken. 100% chance of being broken (bug) > some chance of breakage from fixing it

(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")

jiggawatts 2 months ago
I've experienced a nearly identical scenario where a large fleet of identical servers (Citrix session hosts) were crashing at a "rate" high enough that I had to "scale up" my crash dump collection scripts with automated analysis, distribution into about a hundred buckets, and then per-bucket statistical analysis of the variables. I had to compress, archive, and then simply throw away crash dumps because I had too many.
It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.
I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.
I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.
"No." was the firm answer from management.
"Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.
You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.
- eviks 2 months ago
  
  Yeah, long periods of total disfunction get ingrained
  Though just to ref my original point
  > burned before, and they're never touching the stove again
  Except they are sitting on the stove with their asses burning, which cuts all the needed cooling off their heads!
  
  1 reply →

bombcar 2 months ago

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.

Isn't this where Oracle is with their DB? Wasn't HN complaining about that?

idorosen 2 months ago

Or to simplify the product and rebuild.

teeray 2 months ago

“Rebuild” is also a four-letter word though at this stage too. The customer has a panel of knob-and-tube wiring and aluminum paper-wrapped wire in the house. They want a new hot tub. They don’t want some electrician telling them they need to completely rewire their house first at huge expense, such that they cannot afford the hot tub anymore. They’ll just throw the electrician out and get some kid in a pickup truck (“You’re Absolutely Right Handyman LLC”) to run a lamp cord to their new hot tub. Once the house burns to the ground, the new owners will wire their new construction correctly.
axelriet 2 months ago

Exactly. But he’s right about management, first the problem must be acknowledged and that may make some people look bad.

egorfine 2 months ago

writing tests and then meticulously fixing bugs does not increase shareholders' value.

branko_d 2 months ago

Dave Cutler and his team are a clear counter-example. They famously shipped Windows NT with zero known bugs, which clearly brought enormous shareholder value.
The problem, of course, is that this sort of thing doesn’t bring value next quarter.

rk06 2 months ago

once you reach the stage, the only escape is to give up on it. and move on.

somethings are beyond your control and capabilities

doctorpangloss 2 months ago

if the service is so shitty, why are people paying so much fucking money for it?

is microsoft committing an accounting fraud?

mike_hearn 2 months ago
I worked at a startup that was using Azure. The reason was simple enough - it had been founded by finance people who were used to Excel, so Windows+Office was the non-negotiable first bit of IT they purchased. That created a sales channel Microsoft used to offer generous startup credits. The free money created a structural lack of discipline around spending. Once the startup credits ran out, the company became faced with a huge bill and difficulty motivating people to conserve funds.
At the start I didn't have any strong opinion on what cloud provider to use. I did want to do IT the "old fashioned way" - rent a big ass bare metal or cloud VM, issue UNIX user accounts on it and let people do dev/test/ad hoc servers on that. Very easy to control spending that way, very easy to quickly see what's using the resources and impose limits, link programs to people, etc. I was overruled as obviously old fashioned and not getting with the cloud programme. They ended up bleeding a million dollars a month and the company wasn't even running a SaaS!
I ended up with a very low opinion of Azure. Basic things like TCP connections between VMs would mysteriously hang. We got MS to investigate, they made a token effort and basically just admitted defeat. I raged that this was absurd as working TCP is table stakes for literally any datacenter since the 1980s, but - sad to say - at this time Azure's bad behavior was enabled by a widespread culture of CV farming in which "enterprise" devs were all obsessed with getting cloud tech onto their LinkedIn. Any time we hit bugs or stupidities in the way Azure worked I was told the problem was clearly with the software I'd written, which couldn't be "cloud native", as if it was it'd obviously work fine in Azure!
With attitudes like that completely endemic outside of the tech sector, of course Microsoft learned not to prioritize quality.
We did eventually diversify a bit. We needed to benchmark our server software reliably and that was impossible in Azure because it was so overloaded and full of noisy neighbours, so we rented bare metal servers in OVH to do that. It worked OK.
- jrl 2 months ago
  
  "Basic things like TCP connections between VMs would mysteriously hang"
  This is like a car that can't even get you two blocks from home. Amazing.
- pjmlp 2 months ago
  
  I have had bad experiences across all major vendors.
  The main reason I used to push for Azure instead during the last years was the friendliness of their Web UIs, and having the VS Code integration (it started as an Azure product after all).
  
  2 replies →
bostik 2 months ago

Corporate inertia. Sibling comment uses the term "hostage situation" which I admit is pretty apt.
Microsoft is an approved vendor in every large enterprise. That they have been approved for desktop productivity, Sharepoint, email and on-prem systems does not enter the picture. That would be too nuanced.
Dealing with a Large Enterprise[tm] is an exercise in frustration. A particular client had to be deployed to Azure because their estimate was that getting a new cloud vendor approved for production deployments would be a gargantuan 18-to-24 month org-wide and politically fraught process.
If you are a large corp and have to move workloads to the cloud (because let's be honest: maintaining your own data centres and hardware procurement pipelines is a serious drag) then you go with whatever vendor your organisation has approved. And if the only pre-approved vendor with a cloud offering is Microsoft, you use Azure.
rawgabbit 2 months ago

The US government’s experts called Azure “a pile of shit”; they got overruled.
https://www.propublica.org/article/microsoft-cloud-fedramp-c...
hunterpayne 2 months ago
Because Azure customers are companies that still, in 2026 only use Windows. Anyone else uses something else. Turns out, companies like that don't tend to have the best engineering teams. So moving an entire cloud infrastructure from Azure to say AWS, probably is either really expensive, really risky or too disruptive to do for the type of engineering team that Azure customers have. I would expect MS to bleed from this slowly for a long time until they actually fix it. I seriously doubt they ever will but stranger things have happened.
- pjmlp 2 months ago
  
  Turns out outside companies shipping software products aspiring to be the next Google or Apple, most companies that work outside software industry also need software to run their business and they couldn't care less about HN technology cool factor.
  They use whatever they can to ship their products into trucks, outsourcing their IT and development costs , and that is about it.
  
  1 reply →
MyHonestOpinon 2 months ago

I have worked at two retail companies where AWS was a no no. They didn't want to have anything depending on a competitor(Amazon). So they went the Azure route.
bradleyjg 2 months ago

CFOs love it because Microsoft does bundle pricing with office. Plus they love to give large credits to bootstrap lock-in.
tw04 2 months ago

You’re assuming the alternatives don’t have just as many issues. There’s been exactly one “whistleblower” who is probably tiptoeing the line of a lawsuit. I wouldn’t assume just because there isn’t a similar disgruntled gcp or aws engineer doesn't mean they don't have similar ways.
functional_dev 2 months ago
this made me look into how cloud hypervisors actually work on HW level.. they all offload it to custom HW (smart nic, fpga, dpu, etc..). cpu does almost nothing except for tenant work. AWS -> Nitro, Azure -> FPGA, NVIDIA sells DPUs.
Here is interactive visual guide if anyone wants to explore - https://vectree.io/c/cloud-virtualization-hardware-nitro-cat...
- axelriet 2 months ago
  
  VM management does not run on the FPGA; it’s regular Win32 software on Windows, with aspirations to run some equivalent, someday, on the SoC next to the FPGA on the NIC. The programmable hardware is used for network paths and PCIe functions, where it can project NICs and NVMe devices to VMs to bypass software-based, VMBus-backed virtual devices, all of which end up being serviced on the host who controls the real hardware. Lookup SR-IOV for the bypass. So yes, that’s I/O bypass/offload, but the VM management stack offload is a distinct thing that does not require an FPGA, just a SoC.
miyuru 2 months ago

most the upper management of companies who use them have dont have the technical competence to see it. (eg: banks, supermarket chains, manufacturing companies)
once they are in, no one likes to admit they made a mistake.
staticassertion 2 months ago

Depending on the space you work in, you have almost no choice at all. If you're building for government then you're going to use Microsoft, almost "end of story".
fxtentacle 2 months ago

It’s more of a hostage situation.
llama052 2 months ago

Yeah it’s entirely business people and executives who make these decisions in most companies. Not the ones who use it or implement on it.
fodkodrasz 2 months ago
Because the alternatives are also in similar state.
AWS or GCP are all pretty crap. You use any of them, any you'll hit just enough rough edges. The whole industry is just grinding out slop, quality is not important anywhere.
I work with AWS on a daily basis, and I'm not really impressed. (Also nor did GCP impress me on the short encounter I had with it)
- Balinares 2 months ago
  
  I don't know about AWS or the rest of GCP, but in terms of engineering, my experience of GCE was at least an entire order of magnitude better than what the article alleges about Azure. Security and reliability were taken extremely seriously, and the quality of the engineering was world-class. I hope it has stayed like this since then. It was a worthwhile thing to experience.
- staticassertion 2 months ago
  
  This isn't it at all. AWS does not have the same sorts of insane cross-tenancy exploits that Azure has had, for example.
  The reason that Azure has so many customers is very simply because Azure is borderline mandated by the US government.