← Back to context

Comment by PunchyHamster

5 days ago

> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).

And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".

I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.

> "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?"

That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.

  • But we know this is not actually robust because storage and power failures tend to be correlated. The most recent Jepsen analysis again highlights that it's flawed thinking: https://jepsen.io/analyses/nats-2.12.1

    • The Aurora paper [0] goes into detail of correlated failures.

      > In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. [..] With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability.

      As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments

      > We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum. At our observed failure rates, that’s sufficiently unlikely, even for the number of databases we manage for our customers.

      [0] https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...

      2 replies →

The biggest lie we’ve been told is that databases require global consistency and a global clock. Traditional databases are still operating with Newtonian assumptions about absolute time, while the real world moves according to Einstein’s relativistic theory, where time is local and relative. You dont need global order, you dont need global clock.

  • You need a clock but you can have more than one. This is an important distinction.

    Arbitrating differences in relative ordering across different observer clocks is what N-temporal databases are about. In databases we usually call the basic 2-temporal case “bitemporal”. The trivial 1-temporal case (which is a quasi-global clock) is what we call “time-series”.

    The complexity is that N-temporality turns time into a true N-dimensional data type. These have different behavior than the N-dimensional spatial data types that everyone is familiar with, so you can’t use e.g. quadtrees as you would in the 2-spatial case and expect it to perform well.

    There are no algorithms in literature for indexing N-temporal types at scale. It is a known open problem. That’s why we don’t do it in databases except at trivial scales where you can just brute-force the problem. (The theory problem is really interesting but once you start poking at it you quickly see why no one has made any progress on it. It hurts the brain just to think about it.)

  • Till the financial controller shows up at the very least.

    Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff

    • All financial systems don't care about time.

      Prety much all financial transactions are settled with a given date, not instantly. Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).

      For that matter, the ultimate in BASE for financial transactions is the humble check.

      That is a great example of "money out" that will only be settled at some time in the future.

      There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.

      1 reply →

  • That's why we use UUIDv7 primary keys. Relativity be damned, our replication strategy does not depend upon the timestamp factor.

Happens all the time (the ignores best practices because it’s convenient or ‘just because’ to do something different), literally everywhere including normal society.

Frankly, it’s shocking anything works at all.