Comment by kmxdm
2 years ago
Writes are completed to the host when they land on the SSD controller, not when written to Flash. The SSD controller has to accumulate enough data to fill its write unit to Flash (the absolute minimum would be a Flash page, typically 16kB). If it waited for the write to Flash to send a completion, the latency would be unbearable. If it wrote every write to Flash as quickly as possible, it could waste much of the drive's capacity padding Flash pages. If a host tried to flush after every write to force the latter behavior, it would end up with the same problem. Non-consumer drives solve the problem with back-up capacitance. Consumer drives do not have this. Also, if the author repeated this test 10 or 100 times on each drive, I suspect that he would uncover a failure rate for each consumer drive. It's a game of chance.
The whole point of explicit flush is to tell the drive that you want the write at the expense of performance. Either the drive should not accept the flush command or it should fulfill it, not lie.
(BTW this points out the crappy use of the word “performance” in computing to mean nothing but “speed”. The machine should “perform” what the user requests — if you hired someone to do a task and they didn’t do it, we’d say they failed to perform. That’s what’s going on here.)
The more dire problem is the case where the drive runs out of physical capacity before logical capacity. If the host flushes data that is smaller than the physical write unit of the SSD, capacity is lost to padding (if the SSD honors every Flush). A "reasonable" amount of Flush would not make too much of a difference, but a pathological case like flush-after-every-4k would cause the SSD to run out of space prematurely. There should be a better interface to handle all this, but the IO stack would need to be modified to solve what amounts to a cost issue at the SSD level. It's a race to the bottom selling 1TB consumer SSDs for less than $100.
I still don't think this is the problem, the drive can just slow down accepting writes until it has reclaimed enough space.
The bigger problem is manufacturers chasing the performance. Generally you get the feeling they just hit their firmware with a hammer so it barely doesn't break NTFS.
See also the drama around btrfs' "unreliability", which is all traced back to drives with broken firmware. I fully expect bcachefs will get exactly the same problems.
9 replies →
If it can’t honor the flush it should return an error.
This is the whole point of a FLUSH though. You expect latency penalties and worse performance (and extra pages) if you flush, but that's the expected behaviour: not for it to (apparently) completely disregard the command while pretending like it's done it.
> Non-consumer drives solve the problem with back-up capacitance.
I’m pretty sure they used to be on consumer drives too. Then they got removed and all the review sites gave the manufacturer a free pass even though they’re selling products that are inadequate.
Disks have one job, save data. If they can’t do that reliably they’re defective IMO.
> If a host tried to flush after every write to force the latter behavior, it would end up with the same problem.
So? No reason to break the contract that flush makes all submitted writes durable. The drive can compact space in the background.
Yes, GC should be smart enough to free up space from padding. But then there's a write amplification penalty and meeting endurance specifications is impossible. A padded write already carries a write amplification >1, then GC needs to be invoked much more frequently on top of that to drive it even higher. With pathological Flush usage, you have to pick your poison. Run out of space, run out of SSD life.