Comment by w10-1
2 months ago
Kudos to Cloudflare for clarity and diligence.
When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.
"Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.
twice in a month _so far_
We still have two holidays and associated vacations and vacation brain to go. And then the January hangover.
Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.
NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.
Those AI agents are coding fast, or am I missing some obvious concept here?
1 reply →
reaching for that _one 9 of uptime_
It's weird reading these reports because they don't seem to test anything at all (or at least there's very little mention of testing).
Canary deployment, testing environments, unit tests, integration tests, anything really?
It sounds like they test by merging directly to production but surely they don't
The problem is that Cloudflare do incremental rollouts and loads of testing for _code_. But they don't do the same thing for configuration - they globally push out changes because they want rapid response.
It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.
And to their credit, it does seem they want to change that.
Yeah to me it doesn't make any sense - configuration changes are just as likely to break stuff (as they've discovered the hard way) and both of these issues could have been found in a testing environment before being deployed to production
In the post they described that they observed errors happening in their testing env, but decided to ignore because they were rolling out a security fix. I am sure there is more nuance to this, but I don’t know whether that makes it better or worse
> but decided to ignore because they were rolling out a security fix.
A key part of secure systems is availability...
It really looks like vibe-coding.
This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.
This paraphrased urban legend has nothing to do with quality engineering though? As described, it's designed to the spec and working as intended.
It tracks with my experience in software quality engineering. Asked to find problems with something already working well in the field. Dutifully find bugs/etc. Get told that it's working though so nobody will change anything. In dysfunctional companies, which is probably most of them, quality engineering exists to cover asses, not to actually guide development.
8 replies →
Having observed an average of two mgmt rotations at most of the clients our company is working for this comes at absolutely no surprise to me. Engineering is acting perfectly reasonable, optimizing for cost and time within the constraints they were given. Constraints are updated at a (marketing or investor pleasure) whim without consulting engineering, cue disaster. Not even surprising to me anymore...
... until the extended-range version is ordered and no one remembers to fix the leak. :]
Ariane 5 happens.
They will remember, because it'll have been measured and documented, rigorously.
14 replies →
My hunch is that we do the same with memory leaks or other bugs in web applications where the time of a request is short.