Comment by solatic

4 hours ago

I think there's two kinds of software-producing-organizations:

There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.

Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.

I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

https://sre.google/sre-book/table-of-contents/

Very little connection to the blog post we're commenting on though - at least as far as I can tell.

At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE

Having worked on Cloud Run/Cloud Functions, I think almost every company that isn't itself a cloud provider could be in category 1, with moderately more featureful implementations that actually competed with K8s.

Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.

Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.

At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.

In your example, I think

1. you shouldn't have to think about scale and provisioning at the service/cluster level, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff

3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.

If you think about it, so so so many problems devops teams deal with are literally just

1. We need to be able to handle scaling events

2. We need to control costs

3. Sometimes these conflict and we struggle to translate between the two.

4. The company renting me VMs/running K8s for me won't tell me how much I've spent or set hard billing limits that get enforced in real time (I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).

5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)

6. Our ops team's job is to translate between business logic and resource logics, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.

You gotta go under the hood to fix this stuff.

  • Every time I’ve pushed for cloud run at jobs that were on or leaning towards k8s I was looked at as a very unserious person. Like you can’t be a “real” engineer if you’re not battling yaml configs and argoCD all day (and all night).