I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)

2 years ago (twitter.com)

We shipped a shader cache in the latest release of OBS and quickly had reports come in that the cached data was invalid. After investigating, the cache files were the correct size on disk but the contents were all zero. On a journaled file system this seems like it should be impossible, so the current guess is that some users have SSDs that are ignoring flushes and experience data corruption on crash / power loss.

  • I think this is typical behaviour with ext4 on Linux, if the application doesn't do fsync/fdatasync to flush the data to disk.

    Depending on mount options, ext4fs does metadata journaling ensuring the FS itself is not borked, but not data journaling which would safeguard the file contents in event of unclean shutdown with pending writes in the caches.

    The same phenomenon is at play when people complain that their log files contain NUL bytes after a crash. The file system metadata has been updated for the size of the file to fit the appended write, but the data itself was not written out yet.

    • The current default is data=ordered, which should prevent this problem if the hardware doesn't lie. The data doesn't go in the journal, but it has to be written before the journal is committed.

      There was a point where ext3 defaulted to data=writeback, which can definitely give you files full of null bytes.

      And data=journal exists but is overkill for this situation.

      18 replies →

    • I don't think that's how it works: Flushing metadata before data would be a security concern (consider e.g. the metadata change of increasing a file's length due to an append before the data change itself), so file systems usually only ever do the opposite, which is safe.

      Getting back zeroes after a metadata sync (which must follow a data sync) would accordingly be an indication of something weird having happened at the disk level: We'd expect to either see no data at all, or correct data, but not zeroes or any other file's or previously written stale data.

      10 replies →

  • I had this exact experience with my workstation SSD (NTFS) after a short power loss while NPM was running. After I turned the computer back on, several files (package.json, package-lock.json and many others inside node_modules) had the correct size on disk but were filled with zeros.

    I think the last time I had corrupted files after a power loss was in a FAT32 disk on Win98, but you'd usually get garbage data, not all zeros.

    • > but you'd usually get garbage data, not all zeros.

      You are less likely to get garbage with an SSD in combination with a modern filesystem because of TRIM. Even if the SSD has not (yet) wiped the data, it knows that a block that is marked as unused can be retuned as a block of 0s without needing to check what is currently stored for that block.

      Traditional drives had no such facility to have blocks marked as unused from their PoV, so they always performed the read and returned what they found which was most likely junk (old data from deleted files that would make sense in another context) though could also be a block of zeros (because that block hadn't been used since the drive had a full format or someone zeroed free-space).

    • They may be pointing to unallocated space which on a SSD running TRIM would return all zeros. NTFS is an extremely resilient yet boring filesystem, I cannot remember the last time I had to run chkdsk even after an improper shutdown.

      4 replies →

  • Journaling filesystems (including NTFS, and ext3/ext4 using default mount options) typically only track file structure metadata in the journal, so that is WAI - the filesystem structure was not corrupted, but all bets are off when it comes to the contents of the files.

  • I lost Audacity projects due to BSODs on a Surface Book several times in ~2019: the *_data/**.au files were intact, each containing just a few seconds of audio; but the .aup XML file that maps them and contains whatever else makes up the project was all zeroed. My memory’s fuzzy, but I think it was something like exit sometimes triggering the BSOD, and save-on-exit corrupting consistently if it BSODed, and so the workaround was to remember to save first, and then if it BSODs you’re OK.

  • >experience data corruption on crash / power loss

    You mean on complete system crash, right? Your application crashing shouldn't lead to files being fulls of zeroes as long as you've already written everything out.

Misleading headline since after testing eight more drives, none more failed.

2/12 is not nearly as dramatic as “half”, and the ones that lost data are the cheap brands as one would expect.

  • You can either not editorialize the title, and accept that the thread contains updates, or editorialize it and violate HN guidelines.

    Either choice will lead somebody to complain

  • > "... and the ones that lost data are the cheap brands as one would expect."

    What a sad world to live in, when one comes to expect cheap storage devices to not fulfill intended function.

  • SK Hynix is a major brand and the P31 is a great midrange SSD... except for the fact that it seemingly doesn't care about your data.

    • > SK Hynix is a major brand

      Is it? I passed on an offer for a drive carrying that name, and got something else for slightly more, the other day as I didn't know the name.

      Perhaps their noteworthiness varies internationally? Or do they mainly sell to manufacturers rather than direct to the likes of me?

      1 reply →

    • I have a Sabrent M2 in my own PC, bought it because it was the cheapest option. Incidentally I suspect it's the cause of system-wide slowdown in the past few months, even opening the file explorer takes over ten seconds sometimes.

  • To me the real thing missing is whether those drive advertise power loss protection or not. The next question is whether they are to be used in a laptop where power loss protection is less relevant given the local battery.

    • That should be irrelevant, because flush is flush right? If your SSD does not write the data after a flush it's violating basic hard drive functionality.

      1 reply →

There is a flood of fake SSDs currently, mostly big brands. I've recently purchased counterfeit 1TB. It passes all the tests, performance is ok, it works... except it gets episodes where ioping would be anything between 0.7 ms and 15 seconds, that is under zero load. And these are quality fakes from a physical appearance perspective. The only way I could tell mine was fake is that the official Kingston firmware update tool would not recognize this drive.

  • Where are you seeing counterfeits? AliExpress, Ebay, Amazon?

    • Probably chinese sellers on all those sites. I've noticed a common thread with people who complain about counterfeits is that they're literally buying alphabet soup brand fakes from chinese FBA sellers instead of buying products directly sold by amazon or from more traditional retail channels.

      5 replies →

  • Did you get the fake in an official box? Or OEM version? This is quite a big claim.

    • It doesn't strike me as being a big claim, I recently bought some RAM for a NUC a few weeks ago on Amazon only to determine that it was likely counterfeit. It came in an official box with all packaging intact.

      2 replies →

  • That's interesting. I have a Samsung 990 pro bought on Amazon and have the random lags. I've only noticed it in the terminal, so I figured something else may be the culprit. Never went to 15 secondes, but it can be around 1s.

    The Samsung Magician app on Windows reports it as "genuine" and it was able to apply two firmware updates. The only thing it complains about is that I should be using PCIE 4 instead of 3, but I can't do anything about that.

    • I have been able to fix these random lags by doing multiple full disk reads. The first one will take very long, because it will trigger these lags. Subsequent ones will be much better.

      The leading theory I have read is that maintenance/refreshing on the ssd is not done preventative/correctly by the firmware and you need to trigger it by accessing the data.

      1 reply →

  • If you dig at the vendor data stored on the drive firmware, fakes are easy to spot. Model numbers, vendor ID, and serial numbers will be zero’d out or not conforming to manufacturer spec.

    I purchased a bunch of fake kingston SD cards in China that worked well enough for the price, but crapped out within a year of mild use. I didn’t lose data. It was as if one day they worked. Then one day they were fried.

  • That’s wild. Is this limited to specific distribution channels or can you get them from anywhere?

Under long term heavy duty, I've routinely seen cheap modern platter outperform cheap brand name NVME.

There's some cost cutting somewhere. The NVMEs can't seem to sustain throughput.

It's been pretty disappointing to move I/O bound workloads over and not see notable improvements. The magnitude of data I'm talking about is 500-~3000GB

I've only got two NVME machines for what I'm doing so I'll gladly accept that it's coincidentally flaky bus hardware on two machines, but I haven't been impressed except for the first few seconds.

I know Everyone says otherwise which is why I brought it up. Someone tell me why I'm crazy

Edit: no, I'm not crazy. https://htwingnut.com/2022/03/06/review-leven-2tb-2-5-sata-s... this is similar to what I'm seeing with Crucial and Adata hardware, almost binary performance

  • For write loads this is expected, even for good drives, at some level. They tend to have some faster storage which takes your writes and the controller later pushes the changes to the main body of the drive. If you write in bulk the main, slower, portion can't keep up so the faster cache fills and your write has to wait and will perform as per the slowest part of the drive. Furthermore: good drives tend to have an amount of even faster DRAM cache too, so you'll see two drop-offs in performance during bulk write operations. For mainly read based loads any proper SSD¹ will outperform a traditional drive, but if your use case involves a lot of writing³ you need to make more careful choices⁵ to get good performance.

    I can't say I've ever seen a recent SSD (that isn't otherwise faulty) get slow enough to say it is outperformed by a traditional drive, even just counting the fastest end of the disk, but I've certainly seen them drop to around the same speed during a bulk write.

    --

    [1] unlike this sort of thing: https://www.tomshardware.com/news/low-performance-external-m...

    [2] get SLC-only⁴ drives, not QLC-with-SLC-cache or just-QLC, and so forth

    [3] bulk data processing tasks such as video editing are where you'll feel this significantly, unless your number-crunching is also bottlenecked at the CPU/GPU

    [4] SLC-only is going to be very expensive for large drives, even high-grade enterprise drives tend to be MLC-with SLC-cache. SLC>MLC>TLC>QLC…

    [5] this can be quite difficult in the “consumer” market because you'll sometimes find a later revision of the same drive having a completely different memory and/or controller arrangement despite the headline model name/number not changing at all – this is one reason why early reviews can be very misleading

  • I think cheaper QLC chips use a part of their storage space as SLC, which is fast to write. But once you’ve written the fast part that fits in the SLC cache write throughput quickly tanks as it has to push the data further in to the slower QLC parts.

    • Yeah I guess it works well for how most people use computers which is not actually for computation...

      Modern platter is actually pretty decent and cheap. It's probably still the way to go for large loads unless you have a grove of money trees

  • I used to use an HP EX920 for my system drive and it was abysmally slow at syncs. I'd open Signal and the computer would grind to a halt while it loaded messages from group chats. After much debugging, I found out Signal was saving each message to sqlite in a transaction causing lots of syncing.

    I found some bash script that looped and wrote small blocks synchronously and the HP EX920 was like 20 syncs/sec and my WD RE4 spinner was around 150. Other SSDs were much faster (it was a few years ago so can't remember the exact numbers)

  • 1) Nobody says otherwise about cheap anything NVMe. They're pretty terrible once they've exhausted the write cache. This is well-known and addressed in every decent review by reputable sites.

    2) Sustaining throughput seems the least of our problems when some unknown number of NVMe SSDs might be literally losing flushed data.

  • >Under long term heavy duty, I've routinely seen cheap modern platter outperform cheap brand name NVME.

    Saw this happen with previous job. I upgraded several Windows devices too Windows 10 and the fastest PC was a Dell desktop with a HDD.

    Midrange to lower-mid laptop coupled with low-end SSD's.

Writes are completed to the host when they land on the SSD controller, not when written to Flash. The SSD controller has to accumulate enough data to fill its write unit to Flash (the absolute minimum would be a Flash page, typically 16kB). If it waited for the write to Flash to send a completion, the latency would be unbearable. If it wrote every write to Flash as quickly as possible, it could waste much of the drive's capacity padding Flash pages. If a host tried to flush after every write to force the latter behavior, it would end up with the same problem. Non-consumer drives solve the problem with back-up capacitance. Consumer drives do not have this. Also, if the author repeated this test 10 or 100 times on each drive, I suspect that he would uncover a failure rate for each consumer drive. It's a game of chance.

  • The whole point of explicit flush is to tell the drive that you want the write at the expense of performance. Either the drive should not accept the flush command or it should fulfill it, not lie.

    (BTW this points out the crappy use of the word “performance” in computing to mean nothing but “speed”. The machine should “perform” what the user requests — if you hired someone to do a task and they didn’t do it, we’d say they failed to perform. That’s what’s going on here.)

    • The more dire problem is the case where the drive runs out of physical capacity before logical capacity. If the host flushes data that is smaller than the physical write unit of the SSD, capacity is lost to padding (if the SSD honors every Flush). A "reasonable" amount of Flush would not make too much of a difference, but a pathological case like flush-after-every-4k would cause the SSD to run out of space prematurely. There should be a better interface to handle all this, but the IO stack would need to be modified to solve what amounts to a cost issue at the SSD level. It's a race to the bottom selling 1TB consumer SSDs for less than $100.

      11 replies →

  • This is the whole point of a FLUSH though. You expect latency penalties and worse performance (and extra pages) if you flush, but that's the expected behaviour: not for it to (apparently) completely disregard the command while pretending like it's done it.

  • > Non-consumer drives solve the problem with back-up capacitance.

    I’m pretty sure they used to be on consumer drives too. Then they got removed and all the review sites gave the manufacturer a free pass even though they’re selling products that are inadequate.

    Disks have one job, save data. If they can’t do that reliably they’re defective IMO.

  • > If a host tried to flush after every write to force the latter behavior, it would end up with the same problem.

    So? No reason to break the contract that flush makes all submitted writes durable. The drive can compact space in the background.

    • Yes, GC should be smart enough to free up space from padding. But then there's a write amplification penalty and meeting endurance specifications is impossible. A padded write already carries a write amplification >1, then GC needs to be invoked much more frequently on top of that to drive it even higher. With pathological Flush usage, you have to pick your poison. Run out of space, run out of SSD life.

Twitter yuk, can somebody just post the names of the four tested drives and which passed/failed please?

Does advertising a product as adhering to some standard, but secretly knowing that it doesn't 100%, count as e.g. fraud? I.e., is there any established case law on the matter?

I'm thinking of this example, but also more generally USB devices, Bluetooth devices, etc.

  • > there any established case law on the matter

    Always makes me laugh.

    Anyways, not in the US where you're probably asking for, but yes, the vast majority of the developed word has that. It's called "false advertising", and exists at least in the EU, Australia, UK. You can't put a label on your product or advert that is false or misleading.

    So if the box says this is a WiFi6E router, but it's actually only 5 because it's using the wrong components to save on costs, you can report them to the relevant authority and they'll be fined (and depending on the case and scenario you get compensation). The process is a harder bordering on the impossible if you bought from AliExpress from a random no name vendor though, but as long as the vendor or platform or store exists in the country with the sensible regulation you can report it.

    • That’s not really what the commenter was asking. That’d be false advertising in the US too.

      I think the question is less “if they skip on parts and lie” and more along the lines of incompleteness. Like “it’s an HTTP server, but they saved on effort and implement put as post, which works fine for most of use cases”.

      That said, I’d guess this would be a pretty hard case to win. The law typically requires intent when false advertising, so if they didn’t know they didn’t follow the spec they might be fine. And it depends on the claims and what the consumer can expect. Like, if you deliberately don’t say explain the exact spec your SSD complies with, and you make no explicit promises of compatibility, it’s a harder win. Like I bet few SSD manufacturers will say “Serial ATA v3.5 (may 2023) tested and compatible with OpenXFS commit XYZ on Debian Linux running kernel version 4.3.2”. But if they say “super fast SSD with a physical SATA cable socket”, then what really was false if it doesn’t support the full spec?

  • I was under the impression that a lot of off-brand USB devices didn't use the USB logo specifically to get around certification requirements. Basically, they just aren't advertising adherence to a standard. No idea about NVMe or BT.

  • Not a lawyer, but I doubt it – otherwise you might have a case against Intel and AMD regarding Spectre and Meltdown?

    It might be a different story if the spec was intentionally violated, though (rather than incidentally, i.e. due to an idea that should have been transparent/indistinguishable externally but didn't work out).

    • "Oops we didn't mean to do that" isn't a defense from liability for product not doing what you told the purchaser it would.

      It's their responsibility to do develop the product correctly, do QA, and if a defect is found, advise customers or stop selling the defective goods.

      The greatest scam the computer industry pulled was convincing people that computers are magical, unpredictable devices that are too complex for the industry to be held responsible for things not working as claimed.

      1 reply →

  • Merchantability and implied fitness? You absolutely could try sue the in small claims court for damages.

    For extra fun: if the box carries a trademark from a standards group, you could try adding them into the suit; use of their trademarked logo could be argued to be implied fitness, if there are standards the drive is supposed to meet to use it.

    At the very least they might get tired of the expense of the expense of sending someone to defend the claim, and it would cease to be profitable to engage in this scammery.

    • I don't think it's even implied fitness. Declaring you support SCSI commands is probably a direct advertisement of conformance.

  • I would probably use stronger words than that, data persistence is a big deal, so the missing part of the spec is a fundamental flaw. What's a disk whose persitence is random? You can probably legally assail the substance of the product.

  • For IT products, I doubt it. For sectors where regulation is more mature of course: take food, automotive, etc.

  • I wouldn't say fraud but this issue should trigger a recall.

    • I think it's more or less the same thing, the recall is the way to legally prove you didn't intend to disseminate the flawed product, wheras leaving it on the market after learning of the problem shows intent to keep it there. I would be surprised if a discovery at those companies would not surface an email form engineers discussing this problem.

This is (2022).

Wondering if anything changed since the original tests...

  • > Wondering if anything changed since the original tests...

    You're wondering if firmware writers lie to layers higher up in the stack? I think it's a 100% certainly that there's drive firmware that lies.

    There's a reason why many vendors have compatibility lists, approved firmware versions, and even their "own" (rebranded from an OEM) drives that you have to buy if you want official support (and it's not entirely a money grab: a QA testing infrastructure does cost money).

    • I'm curious whether any of the brands which failed this test owned up to the issue and released firmware updates.

Meanwhile I'm over here jamming Micron 7450 pros into my work laptop for better sync write performance.

I have very little trust in consumer flash these days after seeing the firmware shortcuts and stealth hardware replacements manufacturers resort to to cut costs.

  • Have a solid vendor for these that isn't insanely priced (for home use)? The last couple I tried to buy they sent 7300's and tried to buy me off with a small refund (eBay).

Losing flushes is obviously bad.

I wonder how much perf is on the table in various scenarios when we can give up needing to flush. If you know the drive has some resilience, say, 0.5s of time it can safely writeback during, maybe you can give up flushes (in some cases). How much faster is the app then?

It's be neat to see some low-cost improvements here. Obviously in most cases, just get an enterprise drive with supercapa or batteries onboard. But an ATX power rail that has extra resilience from the supply, or an add-in/pass-through 6-pin sata power supercap... that could be useful too.

  • If the write-cache is reordering requests (and it does, that's the whole point), you can't guarantee that $milliseconds will be enough unless you stop all requests, wait $milliseconds, write your commit record, wait $milliseconds, then resume requests. This is essentially re-implementing write-barriers in an ad-hoc, buggy way which requires stalling requests even longer.

    Flush+FUA requires the data to be stored to non-volatile media. Capacitor-backed RAM dumping to flash is non-volatile. When a drive knows it has enough capacitor-time to finish flushing all preceding writes from the cache, it can immediately say the flush was completed. This can all be handled on the device without the software having to make guesses at how long something has to be written before it's durable.

  • Performance gains wouldn’t be that large as enterprise SSDs already have internal capacitors to flush pending writes to NAND.

    During typical usage the flash controller is constantly journaling LBA to physical addresses in the background, so that the entire logical to physical table isn’t lost when the drive loses power. With a larger capacitor you could potentially remove this background process and instead flush the entire logical to physical table when the drive registers power loss. But as this area makes up ~2% of the total NAND, that’s at absolute best a 2% performance benefit we are potentially missing out on.

    • You could gain much more by coalescing repeated writes to the same address - database scenarios for example

I guess it's time for `fsync_but_really_actually_sync_it_please(2)` (and the lower level equivalents in SATA, NVMe etc.)?

  • > (and the lower level equivalents in SATA, NVMe etc.)?

    This is not a technical problem that needs yet another SATA/SAS/etc command to be standardized. It's a 'social' problem that there's no real incentives for firmware writers to tell the truth 100% of the time.

    The best you can hope for is if you buy a fancy-pants enterprise storage solution with compatibility lists and approved firmware versions.

Flushing in this case is from the SSDs internal DRAM cache to the actual NAND flash?

  • It’s the computer telling the drive “write everything to durable storage (as opposed to some kind of in-drive cache/RAM) and tell me when it’s done”.

    After that command it should be 100% safe to pull the power because everything SHOULD have been written to flash. That’s the point of the command.

    It’s interesting that the drives that do it wrong still take time indicating they’re doing something.

  • The DRAM cache does not hold user data. It holds the flash transition layer that links LBAs to NAND pages. Higher performance drives use 1GB of DRAM per 1TB of NAND. In cheap DRAM-less drives if the I/O to be serviced is not cached in the 1MB or so of SRAM it has to do a double lookup. Once to retrieve the full FTL table from NAND and a second lookup to actually service the I/O.

It'd be nice if there were a database of known bad/known good hardware to reference. I know there's been some spreadsheets and special purpose like the USB-C cables Benson Leung tested.

Especially for consumer hardware on Linux--there's a lot of stuff that "works" but is not necessarily stable long term or that required a lot of hacking on the kernel side to work around issues

Well, yes, but which were those 2 out of 4 vendors?

The model I’d be interested in would be the SK Hynix/Solidigm P44 Pro, as that model competes w the Samsung 9xx evo and pro models

I am a bit annoyed that everyone here takes this at face value. There's 0 evidence given, not even the vendors and models are named to confirm this.

On a related note I tested 4 DDR5 Ram kits from major vendors - half of them corrupt data when exposed to UV light.

This has always been the case? At least it was a course learning when we wrote our own device drivers for minux, even the controllers on spinning metal fib about flush.

At this point, any storage vendor should be required to pass the Sqlite test suite before they can sell their product.

Also…would modern journaling file systems protect against this sort of data loss?

If you need PLP use an enterprise drive. That's what they're for.

Cheap drives don't include large dram caches, lack fast SLC areas, and leave off super-capacitors that allow chips to drain buffers during a power-failure.

"Buy cheap, buy twice" as they say... =)

Without any more information this post is just bullshit. For example, it's not documented how the flushing has been done. On Linux, even issuing 'sync' is not enough: https://unix.stackexchange.com/questions/98568/difference-be...

The bottom answer especially states that "blockdev --flushbufs may still be required if there is a large write cache and you're disconnecting the device immediately after"

The hpdarm utility has a parameter for syncing and flushing device buffers themselves. Seems like all three should be done for a complete flush at all levels.

Don't use home-grade SSDs for storing anything that is considered critical.

The rule is not that hard to remember.

Name the offenders please.

I am sure it might be easy to see visually - a lack of substantial capacitor on the board would indicate a high likelihood of data loss.

That is unfortunate, but I guess those SSDs performed really well and outclassed all others in performance benchmarks? lol