← Back to context

Comment by busterarm

4 years ago

Roblox's outage was uniquely due to the fact that they tried to do two things you should explicitly not do when operating production consul clusters. They even admitted as much.

1) They were using Consul for multiple workloads -- as both their service discovery layer and as a high-performance kv store. These should have been isolated 2) They introduced high read _and_ write load without monitoring the underlying raft performance. Essentially they exceeded the operating parameters of the protocol/boltdb.

well, and they turned on a new feature without proper testing. There's a difference between doing something stupid and the services being inherently flawed. Kubernetes has equally as many if not more footguns to shoot yourself with.

As far as your etcd comment, I know several people at different companies whose literal fulltime job has devolved to tuning etcd for their k8s clusters.

Ask anyone who works/worked at Squarespace 2-3 years ago. They had a full team of 30+ devops folks fulltime dedicated to Kubernetes and their production clusters were failing almost _daily_ for a full year.

k8s and Nomad are both complex systems. My point is that Nomad isn’t some sort of magic bullet that cures the k8s illness. I can’t really speak to k8s 2-3 years ago because I wasn’t working with it. It has been a dream for me and my teams these days. But it still has its issues. Like anything else.

Transitioning from k8s to nomad for simplicity reasons doesn’t make a lick of sense. Nomad is going to fail in _at least_ the same ways and there is going to be exponentially less information as to how to fix it out there.