Comment by cornholio
5 days ago
You know you need to be careful when an Amazon engineer will argue for a database architecture that fully leverages (and makes you dependent of) the strengths of their employer's product. In particular:
> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.
I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
Skipping flushing the local disk seems rather silly to me:
- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.
- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.
- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.
> Skipping flushing the local disk seems rather silly to me
It is. Coordinated failures shouldn't be a surprise these days. It's kind of sad to here that from an AWS engineer. Same data pattern fills the buffers and crashes multiple servers, while they were all "hoping" that others fsynced the data, but it turns out they all filled up and crashed. That's just one case there are others.
Durability always has an asterisk i.e. guaranteed up to N number of devices failing. Once that N is set, your durability is out the moment those N devices all fail together. Whether that N counts local disks or remote servers.
4 replies →
This is an aside, but has anyone tried NVDIMMs as the disk, behind in-package HBM for ram? No idea if it would be any good, just kind of a funny thought. It’s like everything shifted one slot closer to the cores, haha, nonvolatile memory where the RAM use to live, memory pretty close to the core.
I think this entire design approach is on its way out. It turns out that the DIMM protocol was very much designed for volatile RAM, and shoehorning anything else in involves changes through a bunch of the stack (CPU, memory controller, DIMMs), which are largely proprietary and were never intended to support complex devices from other vendors according to published standards. Sure, every CPU and motherboard attempts to work with every appropriately specced DIMM, but that doesn’t mean that the same “physical” bits land in the same DRAM cells if you change your motherboard. Beyond interoperability issues, the entire cache system on most CPUs was always built on the assumption that, if the power fails, the contents of memory do not need to retain any well-defined values. Intel had several false starts trying to build a reliable mechanism to flush writes all the way to the DIMM.
Instead the industry seems to be moving toward CXL for fancy memory-like-but-not-quite-memory. CXL is based on PCIe, and it doesn’t have these weird interoperability and caching issues. Flushing writes all the way to PCIe has never been much of a problem, since basically every PCI or PCIe device ever requires a mechanism by which host software can communicate all the way to the device without the IO getting stalled in some buffer on the way.
I think it is fair to argue that there is a strong correlation between criticality of data and network scale. Most small buisnesses don't need anything S3 scale, but they also don't need 24 hour uptime, and losing the most recent day of data is annoying rather than catastrophic, so they can probably get away without flushing but with daily asynchronous backups to a different machine and a 1 minute UPS to allow for safe storage in the event of a power outage.
Committing to NVMe drive properly is really costly. I'm talking using O_DIRECT | OSYNC or fsync here. Can be in the order of whole milliseconds, easily. And it is much worse if you are using cloud systems.
It is actually very cheap if done right. Enterprise SSDs have write-through caches, so an O_DIRECT|O_DSYNC write is sufficient, if you set things up so the filesystem doesn't have to also commit its own logs.
I just tested the mediocre enterprise nvme I have sitting on my desk (micron 7400 pro), it does over 30000 fsyncs per second (over a thunderbolt adapter to my laptop, even)
5 replies →
Isn't that why a WAL exists, so you didn't actually need to do that with eg postgres and other rdbms?
1 reply →
Not just any old amazon engineer. He's been with Amazon since at least 2008, and he's from Cape Town.
It's very likely that he was part of the team that invented EC2.
Yes, my first thought here was how to build a database that locks you into the cloud vs "for ssds".