Comment by belter

2 months ago

The only post in this thread that actually summarized the core findings of the study, namely:

- ACKed messages can be silently lost due to minority-node corruption.

- A single-bit corruption can cause some replicas to lose up to 78% of stored messages

- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.

- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.

- A crash combined with network delay can cause persistent split-brain and divergent logs.

- Data loss even with “sync_interval = always” in presence of membership changes or partitions.

- Self-healing and replica convergence did not always work reliably after corruption.

…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

So what is next? Nominate NATS for the Silent Failure Peace Prize?

2 comments

belter

traceroute66 2 months ago

> Nominate NATS for the Silent Failure Peace Prize?

One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.

Such as this one:

"Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."

Which Kyle had to call them out on:

"Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."

https://github.com/nats-io/nats-server/issues/7564#issuecomm...

to11mtm 2 months ago

> What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

I have to note the following as a NATS fan:

  - I am horrified at Jespen's reliability findings, however they do vindicate certain design decisions I made in the past

  - 'Core NATS' is really mostly 'redis pubsub but better' and Core NATS is honestly awesome, low friction middleware. I've used it as part of eventing systems in the past and it works great.

  - FWIW, There's an MQTT bridge that requires Jetstream, but if you're just doing QoS 0 you can work around the other warts.

  - If you use Jetstream KV as a cache layer without real persistence (i.e. closer to how one uses Redis KV where it's just memory backed) you don't care about any of this. And again Jetstream KV IMO is better than Redis KV since they added TTL.

All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.

tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.