← Back to context

Comment by debarshri

12 hours ago

We do, let me check with my team and post it here.

There were many issues. On top of my mind was, after a DR drill where in a VM was booted, node did not join the cluster. Apart from that bunch of issues due to etcd, longhorn.

Another major one was the CNI stopped work for a particular node. Garbage collection for images was another, we labelled the images, it would still remove then from the node.

Bunch of these kind of issues when our requirement is fairly straightforward. Therefore we are working towards a strip down version.

There is lot of operation complexity in general and most of us can do without.

I've found a lot of issues come through somewhat naive networking setup - which is encouraged by the "just yolo it" installation instruction in the documentation. If you want to start understanding what's going on you'll end up in very weird corners very quickly. Also, if you don't want the API endpoints available to the world the documentation is not much help.

I've found things more stable if you can give a dedicated interface just for internal k3s communication. It can be a bridge interface on top of a vlan interface - but not the vlan interface itself, or some things will break in very interesting ways. Also, even when using IPv6, just stick with internal IPs and nat everything - touching internal IP ranges is no fun. Plus, if there's a chance you'd ever want to use dual stack, set it up with internal v6 addresses, and just don't use the v6 addresses for now. There's also a lot of unintuitive behaviour around dual stack networking - and lots of areas where documentation is just plain wrong.

I'm scripting our stuff with ansible - one of the more useful things was the realisation that in some areas changes which shouldn't break anything can lead to cluster communication being interrupted, which is a very interesting thing to deal with, especially when you can't pin it to that change that didn't touch anything close to that, and therefore should not be responsible. I've learned, and sprinkled checks to make sure all members can still reach each other in there now, so that at least when I break it on changes I directly know why.

Meanwhile our architecture team that surely supported 0 real life k8s went with no vendor, on premises deployments, claiming it was as easy as booting a VM, after 2y, there is 2 apps running and supposedly all future apps will be deployed on that cluster.

I cannot wait for the end of this month to leave that place.