Comment by belter
23 days ago
The only post in this thread that actually summarized the core findings of the study, namely:
- ACKed messages can be silently lost due to minority-node corruption.
- A single-bit corruption can cause some replicas to lose up to 78% of stored messages
- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.
- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.
- A crash combined with network delay can cause persistent split-brain and divergent logs.
- Data loss even with “sync_interval = always” in presence of membership changes or partitions.
- Self-healing and replica convergence did not always work reliably after corruption.
…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
So what is next? Nominate NATS for the Silent Failure Peace Prize?
> Nominate NATS for the Silent Failure Peace Prize?
One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.
Such as this one:
"Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."
Which Kyle had to call them out on:
"Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."
https://github.com/nats-io/nats-server/issues/7564#issuecomm...
> What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
I have to note the following as a NATS fan:
All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.
tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.