Comment by tjr225
4 years ago
> Managed Kubernetes isn't a panacea if you have a reasonable scale of work that you're throwing at the clusters.
And Nomad is?
I’ve seen nomad, particularly nomad and consul setups fail spectacularly. One only has to search for the recent Roblox outage for a high profile example.
Nomad requires just as much engineering talent to run as k8s, it is just packaged differently - and has less community support, buy in. Nomad plus Consul probably requires even MORE engineering talent to run. How many times have you had to troubleshoot etcd within a k8s cluster?
Come to think of it as a practitioner of both I can’t say k8s has ever failed me as spectacularly as nomad has.
I say all of this as a fan of Nomad.
Moreover: managed nomad is basically not even a thing and if HashiCorp or some partnership were to offer it- it would probably be ludicrously expensive(looking at vault and consul here).
Roblox's outage was uniquely due to the fact that they tried to do two things you should explicitly not do when operating production consul clusters. They even admitted as much.
1) They were using Consul for multiple workloads -- as both their service discovery layer and as a high-performance kv store. These should have been isolated 2) They introduced high read _and_ write load without monitoring the underlying raft performance. Essentially they exceeded the operating parameters of the protocol/boltdb.
well, and they turned on a new feature without proper testing. There's a difference between doing something stupid and the services being inherently flawed. Kubernetes has equally as many if not more footguns to shoot yourself with.
As far as your etcd comment, I know several people at different companies whose literal fulltime job has devolved to tuning etcd for their k8s clusters.
Ask anyone who works/worked at Squarespace 2-3 years ago. They had a full team of 30+ devops folks fulltime dedicated to Kubernetes and their production clusters were failing almost _daily_ for a full year.
k8s and Nomad are both complex systems. My point is that Nomad isn’t some sort of magic bullet that cures the k8s illness. I can’t really speak to k8s 2-3 years ago because I wasn’t working with it. It has been a dream for me and my teams these days. But it still has its issues. Like anything else.
Transitioning from k8s to nomad for simplicity reasons doesn’t make a lick of sense. Nomad is going to fail in _at least_ the same ways and there is going to be exponentially less information as to how to fix it out there.