Comment by faizshah

4 years ago

There’s overhead in both of these models when you try to scale centralized infrastructure at massive scale across a large number of services and teams you run into many scaling problems that your central infra teams will need to solve. In addition this model gives less flexibility to your service teams and adds certain centralized failure scenarios that might be undesirable at the scale of such a company.

Instead of this you can have central infrastructure and platform teams have partial ownership of infrastructure while service teams or departments have partial ownership. In that model a service team or department owns their own cluster with infrastructure code partially provided by the platform team running on compute infra maintained by the infrastructure team.

This model has seen much success at the scale of Amazon while maintaining SLAs and controlling costs. There are of course a number of drawbacks to this model at scale that you can ask any former Amazon engineer about.

I would think a similar model can be mapped to hyperscaling kubernetes where operators, cross cluster infra, and base kubernetes configs are maintained by the platform team while departments or service teams (depending on scale of team size) maintain their own clusters at whatever granularity fits the company’s scale (e.g region, datacenter).

This is also where cloud can help alleviate some pain for your platform and infra teams by using managed solutions to solve some of these problems.

In my opinion both are viable models at scale it just depends on the needs specific to your company.