Comment by solatic

12 days ago

I think there's two kinds of software-producing-organizations:

There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.

Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.

I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

https://sre.google/sre-book/table-of-contents/

Very little connection to the blog post we're commenting on though - at least as far as I can tell.

At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE

  • I don't think people really adhere to Google's definition; most companies don't even have nearly similar scale. Most SRE I've seen are running from one Pagerduty alert to the next and not really doing much of a deep dive into understanding the problem.

  • This makes sense - as am analogy the flight crash investigator is presumably a very different role to the engineer designing flight safety systems.

    • I think you've identified analogous functions, but I don't think your analogy holds as you've written it. A more faithful analogy to OP is that there is no better flight crash investigator than the aviation engineer designing the plane, but flight crash investigation is an actual failure of his primary duty of engineering safe planes.

      Still not a great rendition of this thought, but closer.

those alertmanager descriptions feel scary. I'm stuck in the zabbix era.

what do you mean "progressive rollout of new alerts across the organization"? what kind of alerts?

  • Well, all kinds. Alerting is a really great way to track things that need to change, tell people about that thing along established channels, and also tell them when it's been addressed satisfactorily. Alertmanager will already be configured with credentials and network access to PagerDuty, Slack, Jira, email, etc., and you can use something like Karma to give people interfaces to the different Alertmanagers and manage silences.

    If you're deploying alerts, then yeah you want a progressive rollout just like anything else, or you run the risk of alert fatigue from false positives, which is Really Bad because it undermines faith in the alerting system.

    For example, say you want to start to track, per team, how many code quality issues they have, and set thresholds above which they will get alerted. The alert will make a Jira ticket - getting code quality under control can be afforded to be scheduled into a sprint. You probably need different alert thresholds for different teams, and you want to test the waters before you start having Alertmanager make real Jira issues. So, yeah, progressive rollout.

Having worked on Cloud Run/Cloud Functions, I think almost every company that isn't itself a cloud provider could be in category 1, with moderately more featureful implementations that actually competed with K8s.

Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.

Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.

At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.

In your example, I think

1. you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff

3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.

If you think about it, so so so many problems devops teams deal with are literally just

1. We need to be able to handle scaling events

2. We need to control costs

3. Sometimes these conflict and we struggle to translate between the two.

4. Nobody lets me set hard billing limits/enforcement at the platform level.

(I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).

5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)

6. Our ops team's job is to translate between business logic and resource logic, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.

You gotta go under the hood to fix this stuff.

  • Since you are developing in this domain. Our challenge with both lambdas and cloud run type managed solutions is that they seem incompatible with our service mesh. Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well. Anything custom is out of the question. Since we require end to end mTLS in our setup we cannot use cloud run.

    To me this shows that cloud run is more of an end product than a building block and it hinders the adoption as basically we need to replicate most of cloud run ourselves just to add that tiny bit of also running our Sidecar.

    How do you see this going in your new solution?

    • > Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well

      I'm not exactly sure what this means, a few different interpretations make sense to me. If this is purely a run <-> other gcp product in a vpc problem, I'm not sure how much info about that is considered proprietary and which I could share, or even if my understanding of it is even accurate anymore. If it's that cloud run can't run in your service mesh then it's just, these are both managed services. But yes, I do think it's possible to run into a situation/configuration that is impossible to express in run that doesn't seem like it should be inexpressible.

      This is why designing around multitenancy is important. I think with hierarchical namespacing and a transparent resource model you could offer better escape hatches for integrating managed services/products that don't know how to talk to each other. Even though your project may be a single "tenant", because these managed services are probably implemented in different ways under the hood and have opaque resource models (ie run doesn't fully expose all underlying primitives), they end up basically being multitenant relative to each other.

      That being said, I don't see why you couldn't use mTLS to talk to Cloud Run instances, you just might have to implement it differently from how you're doing it elsewhere? This almost just sounds like a shortcoming of your service mesh implementation that it doesn't bundle something exposing run-like semantics by default (which is basically what we're doing), because why would it know how to talk to a proprietary third party managed service?

  • There are plenty of PaaS components that run on k8s if you want to use them. I'm not a fan, because I think giving developers direct access to k8s is the better pattern.

    Managed k8s services like EKS have been super reliable the last few years.

    YAML is fine, it's just configuration language.

    > you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

    I'm not sure what you mean here. Manage k8s services, and even k8s clusters you deploy yourself, can autoscale across AZ's. This has been a feature for many years now. You just set a topology key on your pod template spec, your pods will spread across the AZ's, easy.

    Most tasks you would want to do to deploy an application, there's an out of the box solution for k8s that already exists. There have been millions of labor-hours poured into k8s as a platform, unless you have some extremely niche use case, you are wasting your time building an alternative.

  • Lots to unpack here.

    I will just say based on recent experience the fix is not Kubernetes bad it’s Kubernetes is not a product platform; it’s a substrate, and most orgs actually want a platform.

    We recently ripped out a barebones Kubernetes product (like Rancher but not Rancher). It was hosting a lot of our software development apps like GitLab, Nexus, KeyCloak, etc

    But in order to run those things, you have to build an entire platform and wire it all together. This is on premises running on vxRail.

    We ended up discovering that our company had an internal software development platform based on EKS-A and it comes with auto installers with all the apps and includes ArgoCD to maintain state and orchestrate new deployments.

    The previous team did a shitty job DIY-ing the prior platform. So we switched to something more maintainable.

    If someone made a product like that then I am sure a lot of people would buy it.

  • > real-time usage -> billing

    This is one of the things that excites me about TigerBeetle; the reason why so much billing by cloud providers is reported only on an hourly granularity at best is because the underlying systems are running batch jobs to calculate final billed sums. Having a billing database that is efficient enough to keep up with real-time is a game-changer and we've barely scratched the surface of what it makes possible.

    • Thanks for mentioning them, we're doing quite similar debit-credit stuff as https://docs.tigerbeetle.com/concepts/debit-credit/ but reading https://docs.tigerbeetle.com/concepts/performance/ they are definitely thinking about the problem differently from us. You need much more prescribed entities (eg resources and skus) on the modelling side and different choices on the performance side (for something like a usage pricing system) for a cloud platform.

      This feels like a single-tenant, centralized ACH but I think what you actually want for a multitenant, multizonal cloud platform is not ACH but something more capability-based. The problem is that cloud resources are billed as subscriptions/rates and you can't centralize anything on the hot-path (like this does) because it means that zone/any availability interacting with that node causes a lack of availability for everything else. Also, the business logic and complexity for computing an actual final bill for a cloud customer's usage is quite complex because it's reliant on so many different kinds of things, including pricing models which can get very complex or bespoke, and it doesn't seem like tigerbeetle wants calculating prices to be part of their transactions (I think)

      The way we're modelling this is with hierarchical sub-ledgers (eg per-zone, per-tenant, per-resourcegroup) and something which you could think of as a line of credit. In my opinion the pricing and resource modelling + integration with the billing tx are much more challenging because they need to be able to handle a lot of business logic. Anyway, if someone chooses to opt-in to invoice billing there's an escape hatch and way for us to handle things we can't express yet.

  • Every time I’ve pushed for cloud run at jobs that were on or leaning towards k8s I was looked at as a very unserious person. Like you can’t be a “real” engineer if you’re not battling yaml configs and argoCD all day (and all night).

    • It does have real tradeoffs/flaws/limitations, chief among them, Run isn't allowed to "become" Kubernetes, you're expected to "graduate". There's been an immense marketing push for Kubernetes and Platform Engineering and all the associated SAAS sending the same message (also, notice how much less praise you hear about it now that the marketing has died down?).

      The incentives are just really messed up all around. Think about all the actual people working in devops who have their careers/job tied to Kubernetes, and how many developers get drawn in by the allure and marketing because it lets them work on more fun problems than their actual job, and all the provisioned instances and vendor software and certs and conferences, and all the money that represents.