Jepsen: NATS 2.12.1

24 days ago (jepsen.io)

Every time someone builds one of these things and skips over "overcomplicated theory", aphyr destroys them. At this point, I wonder if we could train an AI to look over a project's documentation, and predict whether it's likely to lose commmitted writes just based on the marketing / technical claims. We probably can.

  • /me strokes my long grey beard and nods

    People always think "theory is overrated" or "hacking is better than having a school education"

    And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces

  • The only post in this thread that actually summarized the core findings of the study, namely:

    - ACKed messages can be silently lost due to minority-node corruption.

    - A single-bit corruption can cause some replicas to lose up to 78% of stored messages

    - Snapshot corruption can propagate and lead to entire stream deletion across the cluster.

    - The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.

    - A crash combined with network delay can cause persistent split-brain and divergent logs.

    - Data loss even with “sync_interval = always” in presence of membership changes or partitions.

    - Self-healing and replica convergence did not always work reliably after corruption.

    …was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

    So what is next? Nominate NATS for the Silent Failure Peace Prize?

    • > Nominate NATS for the Silent Failure Peace Prize?

      One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.

      Such as this one:

      "Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."

      Which Kyle had to call them out on:

      "Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."

      https://github.com/nats-io/nats-server/issues/7564#issuecomm...

    • > What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

      I have to note the following as a NATS fan:

        - I am horrified at Jespen's reliability findings, however they do vindicate certain design decisions I made in the past
      
        - 'Core NATS' is really mostly 'redis pubsub but better' and Core NATS is honestly awesome, low friction middleware. I've used it as part of eventing systems in the past and it works great.
      
        - FWIW, There's an MQTT bridge that requires Jetstream, but if you're just doing QoS 0 you can work around the other warts.
      
        - If you use Jetstream KV as a cache layer without real persistence (i.e. closer to how one uses Redis KV where it's just memory backed) you don't care about any of this. And again Jetstream KV IMO is better than Redis KV since they added TTL.
      

      All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.

      tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.

  • You can have DeepWiki literally scan the source code and tell you:

    > 2. Delayed Sync Mode (Default)

    > In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.

    However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.

    > Durability Guarantees

    > Even with delayed fsyncs, NATS provides protection against data loss through:

    > 1. Write-Ahead Logging: Messages are written to log files before being acknowledged

    > 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk

    > 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850

    > 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"

    https://deepwiki.com/search/will-nats-lose-uncommitted-wri_b...

  • It's not even "overcomplicated theory" it's just "commit your writes before you say you committed your writes". It's actually way, way more complicated to try to build a system that tries to be correct without doing that.

  • You don’t even have to train an AI. At this point, in lieu of evidence to the contrary, we should default to “it loses committed writes”.

For anyone dealing with databases, and especially distributed databases, I highly recommend reading the Jepsen page on consistency models: https://jepsen.io/consistency/models

It provides a dictionary of terms that we can use to have educated discussions, rather than throwing around terms like "ACID".

Wow. I’ve used NATS for best-effort in-memory pub/sub, which it has been great for, including getting subtle scaling details right. I never touched their persistence and would have investigated more before I did, but I wouldn’t have expected it to be this bad. Vulnerability to simple single-bit file corruption is embarrassing.

> 3.4 Lazy fsync by Default

Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.

  • It's not just better performance on latency benchmarks, it likely improves throughput as well because the writes will be batched together.

    Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.

    • It’s like using a non-cryptographically secure RNG: if you don’t know enough to look for the fsync flag off yourself, it’s unlikely you know enough to evaluate the impact of durability on your application.

      1 reply →

    • You can batch writes while at the same time not acknowledging them to clients until they are flushed, it just takes more bookkeeping.

    • I also think fsync before acking writes is a better default. That aside, if you were to choose async for batching writes, their default value surprises me. 2 minutes seems like an eternity. Would you not get very good batching for throughout even at something like 2 seconds too? Still not safe, but safer.

    • For transactional durability, the writes will definitely be batched ("group commit"), because otherwise throughput would collapse.

  • I always wondered why the fsync has to be lazy. It seems like the fsync's can be bundled up together, and the notification messages held for a few millis while the write completes. Similar to TCP corking. There doesn't need to be one fsync per consensus.

    • That was my immediate thought as well, under the assumption the lazy fsync is for performance. I imagine in some situations, delaying the write until the write confirmation actually happens is okay (depending on delay), but it also occurred to me that if you delay enough, and you have a busy enough system, and your time to send the message is small enough, the number of open connections you need to keep open can be some small or large multiple of the amount you would need without delaying the confirmation message to actual write time.

    • In practice, there must be a delay (from batching) if you fsync every transaction before acknowledging commit. The database would be unusably slow otherwise.

      1 reply →

  • One of the perks of being distributed, I guess.

    The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.

    • It is if they’re in the same physical datacenter. Usually the way this is done is to wait for at least M replicas to fsync, but only require the data to be in memory for the rest. It smooths out the tail latencies, which are quite high for SSDs.

      4 replies →

  • durability through replication and distribution and better throughput to build up more within the window on a lazy fsync

Curious about the differences between content on aphyr.com/tags/jepsen and jepsen.io/analyses. I recently discovered aphyr.com and was excited about the potential insights!

  • Jepsen started as a personal blog series in nights and weekends; jepsen.io is when I started doing it professionally, about ten years ago.

    • Curious : do you have a team of people working with you, or is it mostly solo work ? your work is so valuable, i would be scared for our industry if it had a bus factor of 1.

> > You can force an fsync after each messsage [sic] with always, this will slow down the throughput to a few hundred msg/s.

Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?

  • > Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?

    Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.

  • Yes, this is a reasonably common strategy. It's how Cassandra's batch and group commit modes work, and Postgres has a similar option. Hopefully NATS will implement something similar eventually.

NATS is a fantastic piece of software. But doc’s unpractical and half backed. That’s a shame to be required to retro engineer the software from GitHub to know the auth schemes.

Thanks, those reports are always a quiet pleasure to read even if one is a bit far from the domain.

> By default, NATS only flushes data to disk every two minutes, but acknowledges operations immediately. This approach can lead to the loss of committed writes when several nodes experience a power failure, kernel crash, or hardware fault concurrently—or in rapid succession (#7564).

I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

Coordinated failures shouldn't be a novelty or a surprise any longer these days.

I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.

  • I don't think there is a modern database that have the safest options all turned on by default. For instance the default transaction model for PG is read commited not serializable

    One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.

    • SQLite is alway serializable and by default has synchronous=Full so fsync on every commit.

      The problem is it has terrible defaults for performance (in the context of web servers). Like just bad options legacy options not ones that make it less robust. Ie cache size ridiculously small, temp tables not in memory, WAL off so no concurrent reads/writes etc.

    • CockroachDB is serializable by default, but I don’t know about their other settings.

    • Pretty sure SQL Server won't acknowledge a write until its in the WAL (you can go the opposite way and turn on delayed durability though.)

  • I don't know about Jetstream, but redis cluster would only ack writes after replicating to a majority of nodes. I think there is some config on standalone redis too where you can ack after fsync (which apparently still doesn't guarantee anything because of buffering in the OS). In any case, understanding what the ack implies is important, and I'd be frustrated if jetstream docs were not clear on that.

  • NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

    I like that, and it allows me to build things around it.

    For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.

    • > NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

      Not so fast.

      Their docs makes some pretty bold claims about JetStream....

      They talk about JetStream addressing the "fragility" of other streaming technology.

      And "This functionality enables a different quality of service for your NATS messages, and enables fault-tolerant and high-availability configurations."

      And one of their big selling-points for JetStream is the whole "stora and replay" thing. Which implies the storage bit should be trustworthy, no ?

      4 replies →

  • > Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

    The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

    The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

    Two minutes is a bit too too much (also fdatasync vs fsync).

    • IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.

      tl;dr "multi-transaction group-commit fsync" is alive and well

  • Not flushing on every write is a very common tradeoff of speed over durability. Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted. You can often prevent this by enabling an option or tuning a parameter.

    > I wouldn't trust a product that doesn't default to safest options

    This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.

    For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)

    • > Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted.

      Woah, those are _really_ strong claims. "Lost writes are accepted"? Assuming we are talking about "acknowledged writes", which the article is discussing, I don't think it's true that this is a common default for databases and filesystems. Perhaps databases or K/V stores that are marketed as in-memory caches might have defaults like this, but I'm not familiar with other systems that do.

      I'm also getting MongoDB vibes from deciding not to flush except once every two minutes. Even deciding to wait a second would be pretty long, but two minutes? A lot happens in a busy system in 120 seconds...

      2 replies →

    • I think “most people will have to turn on the setting to make things fast at the expense of durability” is a dubious assertion (plenty of system, even high-criticality ones, do not have a very high data rate and thus would not necessarily suffer unduly from e.g. fsync-every-write).

      Even if most users do turn out to want “fast_and_dangerous = true”, that’s not a particularly onerous burden to place on users: flip one setting, and hopefully learn from the setting name or the documentation consulted when learning about it that it poses operational risk.

      1 reply →

    • In the defense of PG, for better or worse as far as I know, the 'what is RDBMS default' falls into two categories;

      - Read Committed default with MVCC (Oracle, Postgres, Firebird versions with MVCC, I -think- SQLite with WAL falls under this)

      - Read committed with write locks one way or another (MSSQL default, SQLite default, Firebird pre MVCC, probably Sybase given MSSQL's lineage...)

      I'm not aware of any RDBMS that treats 'serializable' as the default transaction level OOTB (I'd love to learn though!)

      ....

      All of that said, 'Inconsistent read because you don't know RDBMS and did not pay attention to the transaction model' has a very different blame direction than 'We YOLO fsync on a timer to improve throughput'.

      If anything it scares me that there's no other tuning options involved such as number of bytes or number of events.

      If I get a write-ack from a middleware I expect it to be written one way or another. Not 'It is written within X seconds'.

      AFAIK there's no RDBMS that will just 'lose a write' unless the disk happens to be corrupted (or, IDK, maybe someone YOLOing with chaos mode on DB2?)

      2 replies →

  • > NATS only flushes data to disk every two minutes, but acknowledges operations immediately.

    Wait, isn't that the whole point of acknowledgments? This is not acknowledgment, it's I'm a teapot.

    • Exactly, it's a teapot. And my point was, it's fine to let the user configure that but shipping it as a default seems fishy. It looks in benchmarks, so that's why they do, just like MongoDB did initially.

  • NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here. If you wanted something fully durable with a stronger persistence story you'd probably use Kafka anyhow.

    • Core nats is ephemeral. Jetstream is meant to be persisted, and presented as a replacement for kafka

    • > NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here

      Dude ... the guy was testing JetStream.

      Which, I quote from the first phrase from the first paragraph on the NATS website:

          NATS has a built-in persistence engine called JetStream which enables messages to be stored and replayed at a later time.

https://github.com/williamstein/nats-bugs

  • For example, https://github.com/williamstein/nats-bugs/issues/5 links to a discussion I have with them about data loss, where they fundamentally don't understand that their incorrect defaults lead to data loss on the application side. It's weird.

    I got very deep into using NATS last year, and then realized the choices it makes for persistence are really surprising. Another horrible example if that server startup time is O(number of streams), with a big constant; this is extremely painful to hit in production.

    I ended up implementing from scratch something with the same functionality (for me as NATS server + Jetstream), but based on socket.io and sqlite. It works vastly better for my use cases, since socketio and sqlite are so mature.

this is absolutely shocking!Does kafka do fsync on every write?

  • No. Redpanda has made a lot of noise about this over the years [0], and Confluent's Jack Vanlightly has responded in a fair bit of detail [1].

    [0]: https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s...

    [1]: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

    • I think all modern system even scylla db do commit batch no fsync on every write, you either need throughput or durability both cannot exist together. Only thing what redpanda claim is you have to do replication before fsync so your data is not lost if the written node is dead due to a power failure. this is how scylla and cassandra works, if iam not wrong, so even if a node dead before the batch fsync, replication will be done before fsync from memtable,so other nodes will bring the durability and data loss is no longer true in a replicated setup. single node? obviously 100% data loss. but this is the trade off for a high tps system vs durable single ndoe system brings. its how you want to operate.

      1 reply →

If you are looking for a serverless alternative to JetStream, check out https://s2.dev

Pros: unlimited streams with the durability of object storage – JetStream can only do a few K topics

Cons: no consumer groups yet, it's on the agenda

nats jetstream vs say redis streams - which one have people found easier to work with ?

  • When I worked with bounded Redis streams a couple of years ago we had to implement our own backpressure mechanism which was quite tricky to get right.

    To implement backpressure without relying on out of band signals (distributed systems beware) you need to have a deep understanding of the entire redis streams architecture and how the the pending entries list, consumers groups, consumers etc. works and interacts to not lose data by overwriting yourself.

    Unbounded would have been fine if we could spill to disk and periodically clean up the data, but this is redis.

    Not sure if that has improved.

    • I don't have a direct comment to add, but after working on the fringes of streams a bit, they've worked as advertised, but the API surface area for them is full of cases where, as you say, you have to kind of internalize the full architecture to really understand what's going on. It can be a bit overwhelming.

Definitely thought this was about aviation for a moment.