Comment by busterarm

4 years ago

Managed Kubernetes isn't without its own overhead.

Let's take the most popular option, GKE. You're going to be on a release channel and you need to understand that Google is going to upgrade your clusters/node pools frequently and it's your responsibility to keep your workloads ahead of deprecated features. If you pay attention to your emails, okay, but if there's ever a critical security issue in K8s, Google will force migrate you to the latest patched version immediately. Even in Stable. Even if some Beta features you depend on aren't actually working in that new version (this actually happened when Workload Identity was still in Beta... because "you shouldn't be using Beta features in Production" even though that's what Google's documentation and your TAM said explicitly to do). Good luck!

Then there's the fact that Google doesn't give you any overt knobs to manage the memory of their workloads in the kube-system namespace. They spec these with a miserly amount of memory that will cause their metadata server and monitoring stack to crashloopbackoff if you have even a moderate amount of jobs that log to stdout/stderr. Some of these you can add a nanny resource to expand the memory, but you will have to reapply this every time the node pool gets upgraded/replaced, which can and might happen during any maintenance window (weekly, or at any time). Some others you have to edit Google's deployment and reapply your changes every time Google deploys a new version. This means that you need to monitor and manage Google's own workloads in your "managed" clusters.

Setting up your own custom ingress to work within the parameters of their loadbalancers is non-trivial. If you use rpc/gRPC, you _will have to do this_. GKE's ingress controller will not serve your needs.

Setting up the VPC and required subnets properly is equally non-trivial, but luckily you just have to figure out the proper operating parameters once. Oh and remember Google Cloud is deny by default. Your developers will need to know some things about how to expose their workloads with proper firewall/Cloud Armor rules. Unless they have prior cloud experience they likely won't have a clue as to what's required. Congrats, now ongoing training of teams of engineers is now part of your role.

Enjoy the misery of operating regional clusters and all of your node pools suddenly becoming non-functional the minute one of your region's AZs go down, which on average happens to at least one of any given cloud's regions twice per year. Hopefully not your region. And you thought that operating regional clusters would give you the kind of HA to avoid these situations but when running K8s on top it's the control plane itself that can't handle that kind of outage...

Oh and if as a requirement to run you have jobs that require changing a kernel parameter that GKE doesn't expose to you to set on the node pools (because they're not yet "safe" sysctls in k8s), such as fs.file-max, you don't have any recourse.

There are numerous tickets open for these issues and they have been open for years. Google isn't forthcoming with solutions, other than they prefer you scale up rather than out when K8s is advertised to be a solution that favors scale out (but for current deficiencies in K8s' software stack the reality is that scale up is truly the preferred option anyway).

Managed Kubernetes isn't a panacea if you have a reasonable scale of work that you're throwing at the clusters. You _will_ have to work with your cloud providers' support team to tune your clusters beyond the defaults and knobs that they expose to regular everyday customers. You'd better have an Enterprise support account that you're paying $35,000/50,000 + %ofSpend a month for.

That's the scale that we're operating at.

> Managed Kubernetes isn't a panacea if you have a reasonable scale of work that you're throwing at the clusters.

And Nomad is?

I’ve seen nomad, particularly nomad and consul setups fail spectacularly. One only has to search for the recent Roblox outage for a high profile example.

Nomad requires just as much engineering talent to run as k8s, it is just packaged differently - and has less community support, buy in. Nomad plus Consul probably requires even MORE engineering talent to run. How many times have you had to troubleshoot etcd within a k8s cluster?

Come to think of it as a practitioner of both I can’t say k8s has ever failed me as spectacularly as nomad has.

I say all of this as a fan of Nomad.

Moreover: managed nomad is basically not even a thing and if HashiCorp or some partnership were to offer it- it would probably be ludicrously expensive(looking at vault and consul here).

  • Roblox's outage was uniquely due to the fact that they tried to do two things you should explicitly not do when operating production consul clusters. They even admitted as much.

    1) They were using Consul for multiple workloads -- as both their service discovery layer and as a high-performance kv store. These should have been isolated 2) They introduced high read _and_ write load without monitoring the underlying raft performance. Essentially they exceeded the operating parameters of the protocol/boltdb.

    well, and they turned on a new feature without proper testing. There's a difference between doing something stupid and the services being inherently flawed. Kubernetes has equally as many if not more footguns to shoot yourself with.

    As far as your etcd comment, I know several people at different companies whose literal fulltime job has devolved to tuning etcd for their k8s clusters.

    Ask anyone who works/worked at Squarespace 2-3 years ago. They had a full team of 30+ devops folks fulltime dedicated to Kubernetes and their production clusters were failing almost _daily_ for a full year.

    • k8s and Nomad are both complex systems. My point is that Nomad isn’t some sort of magic bullet that cures the k8s illness. I can’t really speak to k8s 2-3 years ago because I wasn’t working with it. It has been a dream for me and my teams these days. But it still has its issues. Like anything else.

      Transitioning from k8s to nomad for simplicity reasons doesn’t make a lick of sense. Nomad is going to fail in _at least_ the same ways and there is going to be exponentially less information as to how to fix it out there.

Wow. Thanks for writing that. I've had friends preaching the miracles of k8s or managed k8s to me for years but reading your post I'm glad I went with a rinky-dinky systemd+debs setup for my current servers. You can't even adjust the FD limit?! Force upgrades without notice that remove features in production with a blame-the-user mentality?

  • Google's approach to all of their services is an upgrade treadmill. They take the opposite approach of AWS who operate things like an actual service provider.

    Google will change APIs and deprecate things with little to no advanced notice. Anywhere except Ads, really. The progress of code at Google is a sacred cow.

    You may have found random APIs of theirs barely in KTLO status -- like the last time I used their Geocoding API 5 or 6 years ago, not a single one of their own client libraries worked according to how they were documented. Specifically the way you do auth was wildly different from the documentation and seemed to be wildly different between client libraries as well. It seemed to me like they went through three or four rounds of iteration w/ that API and abandoned different client libraries at different stages. Extremely bizarre.

    If you can operate like that, then using Google's stuff is fine. You just have to be ready and willing to drop everything you're doing to fix things at any time because Google decided to break something that you rely on.