Comment by rdtsc

2 months ago

> By default, NATS only flushes data to disk every two minutes, but acknowledges operations immediately. This approach can lead to the loss of committed writes when several nodes experience a power failure, kernel crash, or hardware fault concurrently—or in rapid succession (#7564).

I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

Coordinated failures shouldn't be a novelty or a surprise any longer these days.

I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.

35 comments

rdtsc

Thaxll 2 months ago

I don't think there is a modern database that have the safest options all turned on by default. For instance the default transaction model for PG is read commited not serializable

One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.

andersmurphy 2 months ago

SQLite is alway serializable and by default has synchronous=Full so fsync on every commit.
The problem is it has terrible defaults for performance (in the context of web servers). Like just bad options legacy options not ones that make it less robust. Ie cache size ridiculously small, temp tables not in memory, WAL off so no concurrent reads/writes etc.
hxtk 2 months ago

CockroachDB is serializable by default, but I don’t know about their other settings.
jwr 2 months ago

FoundationDB provides strict serializability by default.
hobs 2 months ago

Pretty sure SQL Server won't acknowledge a write until its in the WAL (you can go the opposite way and turn on delayed durability though.)

lubesGordi 2 months ago

I don't know about Jetstream, but redis cluster would only ack writes after replicating to a majority of nodes. I think there is some config on standalone redis too where you can ack after fsync (which apparently still doesn't guarantee anything because of buffering in the OS). In any case, understanding what the ack implies is important, and I'd be frustrated if jetstream docs were not clear on that.

akshayshah 2 months ago

At least per the Redis docs, clusters acknowledge writes before they're replicated: https://redis.io/docs/latest/operate/oss_and_stack/managemen...
The docs explicitly state that clusters do not provide strong consistency and can lose acknowledged data.
sk5t 2 months ago

To the best of my knowledge, Redis has never blocked for replication, although you can configure healthy replication state as a prerequisite to accept writes.

KaiserPro 2 months ago

NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

I like that, and it allows me to build things around it.

For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.

traceroute66 2 months ago
> NATS is very upfront in that the only thing that is guaranteed is the cluster being up.
Not so fast.
Their docs makes some pretty bold claims about JetStream....
They talk about JetStream addressing the "fragility" of other streaming technology.
And "This functionality enables a different quality of service for your NATS messages, and enables fault-tolerant and high-availability configurations."
And one of their big selling-points for JetStream is the whole "stora and replay" thing. Which implies the storage bit should be trustworthy, no ?
- KaiserPro 2 months ago
  
  oh sorry I was talking about NATS core. not jetstream. I'd be pretty sceptical about persistence
  
  3 replies →

gopalv 2 months ago

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

Two minutes is a bit too too much (also fdatasync vs fsync).

senderista 2 months ago

IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.
tl;dr "multi-transaction group-commit fsync" is alive and well

0xbadcafebee 2 months ago

Not flushing on every write is a very common tradeoff of speed over durability. Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted. You can often prevent this by enabling an option or tuning a parameter.

> I wouldn't trust a product that doesn't default to safest options

This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.

For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)

TheTaytay 2 months ago
> Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted.
Woah, those are _really_ strong claims. "Lost writes are accepted"? Assuming we are talking about "acknowledged writes", which the article is discussing, I don't think it's true that this is a common default for databases and filesystems. Perhaps databases or K/V stores that are marketed as in-memory caches might have defaults like this, but I'm not familiar with other systems that do.
I'm also getting MongoDB vibes from deciding not to flush except once every two minutes. Even deciding to wait a second would be pretty long, but two minutes? A lot happens in a busy system in 120 seconds...
- 0xbadcafebee 2 months ago
  
  All filesystems that I'm aware of don't sync to disk on every write by default, and you absolutely can lose data. You have to intentionally enable sync. And even then the disk can still lose the writes.
  Most (all?) NoSQL solutions are also eventual-consistency by default which means they can lose data. That's how Mongo works. It syncs a journal every 30-100 ms, and it syncs full writes at a configurable delay. Mongo is terrible, but not because it behaves like a filesystem.
  Note that this is not "bad", it's just different. Lots of people use these systems specifically because they need performance more than durability. There are other systems you can use if you need those guarantees.
  
  1 reply →
zbentley 2 months ago
I think “most people will have to turn on the setting to make things fast at the expense of durability” is a dubious assertion (plenty of system, even high-criticality ones, do not have a very high data rate and thus would not necessarily suffer unduly from e.g. fsync-every-write).
Even if most users do turn out to want “fast_and_dangerous = true”, that’s not a particularly onerous burden to place on users: flip one setting, and hopefully learn from the setting name or the documentation consulted when learning about it that it poses operational risk.
- hxtk 2 months ago
  
  I always think about the way you discover the problem. I used to say the same about RNG: if you need fast PRNG and you pick CSPRNG, you’ll find out when you profile your application because it isn’t fast enough. In the reverse case, you’ll find out when someone successfully guesses your private key.
  If you need performance and you pick data integrity, you find out when your latency gets too high. In the reverse case, you find out when a customer asks where all their data went.
to11mtm 2 months ago
In the defense of PG, for better or worse as far as I know, the 'what is RDBMS default' falls into two categories;
- Read Committed default with MVCC (Oracle, Postgres, Firebird versions with MVCC, I -think- SQLite with WAL falls under this)
- Read committed with write locks one way or another (MSSQL default, SQLite default, Firebird pre MVCC, probably Sybase given MSSQL's lineage...)
I'm not aware of any RDBMS that treats 'serializable' as the default transaction level OOTB (I'd love to learn though!)
....
All of that said, 'Inconsistent read because you don't know RDBMS and did not pay attention to the transaction model' has a very different blame direction than 'We YOLO fsync on a timer to improve throughput'.
If anything it scares me that there's no other tuning options involved such as number of bytes or number of events.
If I get a write-ack from a middleware I expect it to be written one way or another. Not 'It is written within X seconds'.
AFAIK there's no RDBMS that will just 'lose a write' unless the disk happens to be corrupted (or, IDK, maybe someone YOLOing with chaos mode on DB2?)
- hansihe 2 months ago
  
  CockroachDB does Serializable by default
- ncruces 2 months ago
  
  > I -think- SQLite with WAL falls under this
  No. SQLite is serializable. There's no configuration where you'd get read committed or repeatable read.
  In WAL mode you may read stale data (depending on how you define stale data), but if you try to write in a transaction that has read stale data, you get a conflict, and need to restart your transaction.
  There's one obscure configuration no one uses that's read uncommitted. But really, no one uses it.

wseqyrku 2 months ago

> NATS only flushes data to disk every two minutes, but acknowledges operations immediately.

Wait, isn't that the whole point of acknowledgments? This is not acknowledgment, it's I'm a teapot.

rdtsc 2 months ago

Exactly, it's a teapot. And my point was, it's fine to let the user configure that but shipping it as a default seems fishy. It looks in benchmarks, so that's why they do, just like MongoDB did initially.

CuriouslyC 2 months ago

NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here. If you wanted something fully durable with a stronger persistence story you'd probably use Kafka anyhow.

nchmy 2 months ago

Core nats is ephemeral. Jetstream is meant to be persisted, and presented as a replacement for kafka
traceroute66 2 months ago
> NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here
Dude ... the guy was testing JetStream.
Which, I quote from the first phrase from the first paragraph on the NATS website:
NATS has a built-in persistence engine called JetStream which enables messages to be stored and replayed at a later time.
petre 2 months ago
So is MQTT, why bother with NATS then?
- KaiserPro 2 months ago
  
  MQTT doesn't have the same semantics. https://docs.nats.io/nats-concepts/core-nats/reqreply request reply is really useful if you need low latency, but reasonably efficient queuing. (making sure to mark your workers as busy when processing otherwise you get latency spikes. )
  
  2 replies →