← Back to context

Comment by faizshah

4 years ago

Maybe I’m misunderstanding but why would you want to run one giant kubernetes cluster at hyper scale? I would think at hyper scale you would be running micro services with one kubernetes cluster per service per datacenter/region.

At hyper scale, efficiency matters a little. Small gains multiplied by big numbers become big gains. The overhead of running a cluster is not small, but more importantly - the overhead of managing nodes between services is real.

No, you would not want to run a cluster per service; Kubernetes is somewhat multitenant. You'd get better efficiency from running one cluster per datacenter, with all those teams in their own namespaces. Kubernetes can stack many containers per physical host, with widely varied workloads cooperatively sharing.

  • There’s overhead in both of these models when you try to scale centralized infrastructure at massive scale across a large number of services and teams you run into many scaling problems that your central infra teams will need to solve. In addition this model gives less flexibility to your service teams and adds certain centralized failure scenarios that might be undesirable at the scale of such a company.

    Instead of this you can have central infrastructure and platform teams have partial ownership of infrastructure while service teams or departments have partial ownership. In that model a service team or department owns their own cluster with infrastructure code partially provided by the platform team running on compute infra maintained by the infrastructure team.

    This model has seen much success at the scale of Amazon while maintaining SLAs and controlling costs. There are of course a number of drawbacks to this model at scale that you can ask any former Amazon engineer about.

    I would think a similar model can be mapped to hyperscaling kubernetes where operators, cross cluster infra, and base kubernetes configs are maintained by the platform team while departments or service teams (depending on scale of team size) maintain their own clusters at whatever granularity fits the company’s scale (e.g region, datacenter).

    This is also where cloud can help alleviate some pain for your platform and infra teams by using managed solutions to solve some of these problems.

    In my opinion both are viable models at scale it just depends on the needs specific to your company.

  • > Kubernetes can stack many containers per physical host, with widely varied workloads cooperatively sharing.

    Until you learn the downsides of this approach at hyper scale, namely that containers and a shared kernel mean that all of your workloads are sharing the same kernel parameters, including things like network limits and timeouts and file handle limits. Multitenancy and containers actually ends up working against you and creates new problems and knobs in your individual jobs that you have to configure -- to the point that it's almost worth just having different types of jobs run on different isolated node pools and eliminating your multitenancy issue anyway.

    Companies that scaled on KVM never had to learn about these limitations and just focused on what their hardware was capable of in aggregate.

    At hyper scale and with multitenancy, microVMs are always going to be the end state -- and while there's k8s support for this, it's far from the default or even most convenient option.

    • Network limits and timeouts aren't different between kubernetes hosts and non-Kubernetes hosts. Network resources are a real resource, and you may need to implement quality of service or custom resources (a new feature [1], and one that is late to the party).

      File handle limits are something no sane workload ever encounters. They are technically a shared resource, but in a sensible kubernetes configuration, it is impossible to hit because the ulimits on each process are low enough. A very small number of teams may need an exception, with good reason, and will typically be cordoned on to their own node classes that are specially tainted.

      Yes, fleet Management via taints offers nothing over the fleet Management that you've already got. This is a good thing. Fleet Management tools are a damage to your reliability. They mean that your machines are non-fungible. Kubernetes great innovation is making machines, units of compute, fungible.

      There are workloads and architectures that will never be suitable for kubernetes. There are HPC Clusters that heavily rely on things like rack-locality that Kubernetes views as damage. Getting rid of them is a net win for humanity.

      [1] https://kubernetes.io/docs/concepts/configuration/manage-res...

      3 replies →

    • At hyper scale you don't need to worry about sharing as much because the important services are far bigger than one machine. That sidesteps the problem: you can apply whatever sysctls or configs you need to do before starting the container.

      "Multitenancy" here means "I have a giant pool of machines and I run a bunch of jobs across them" not "I have a pool of giant machines and I stack jobs on them".

You wouldn't want to run one giant cluster, but at hyper scale you're talking about running thousands or tens of thousands of kubernetes clusters. That's the part that doesn't scale well, for a couple of reasons.

The biggest one is just mechanical: with that many clusters it will be hard to move capacity between clusters, and locality gets baked into everything you do (people do try to build around this, but it's awkward).

If each service or team runs their own kubernetes that's a lot of overhead: kubernetes will need something like 6-7 machines for the cluster (I don't have production experience with kube, spitballing here), so small teams or jobs will have terrible efficiency. Big teams will have to spend a lot of operational effort to manage their fleets.

It's worth noting that at hyperscale there will be individual jobs in a datacenter that are bigger than kubernetes handles comfortably. Handling this efficiently becomes very important, it's literally billions of dollars worth of hardware.

> I would think at hyper scale you would be running micro services with one kubernetes cluster per service per datacenter/region.

While I loathe the term hyper scale... even a single AZ in a single region could be 50,000 machines. k8s is a dumpster fire at that size.

  • But you wouldn’t run a single k8s cluster at that size. K8s isn’t really multi tenant.