← Back to context

Comment by PunchyHamster

6 hours ago

Reliability have very weird curve frankly.

Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.

Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...