I tested four NVMe SSDs from four vendors – half lose FLUSH’d data on power loss

4 years ago (twitter.com)

I always liked the embedded system model where you get flash hardware that has two operations -- erase block and write block. GC, RAID, error correction, etc. are then handled at the application level. It was never clear to me that the current tradeoff with consumer-grade SSDs was right. On the one hand, things like the error correction, redundancy, and garbage collection don't require the attention from CPU (and more importantly, doesn't tie up any bus). On the other hand, the user has no control over what the software on the SSD's chip does. Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.

It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

  • There's really two problems here:

    1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)

    2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.

    There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)

    • Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?

      8 replies →

  • I can recommend the related talk "It's Time for Operating Systems to Rediscover Hardware". [1]

    It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.

    [1] https://www.youtube.com/watch?v=36myc8wQhLo

    • Fundamentally the job of the OS is resource sharing and scheduling. All the low level device management is just a side show.

      Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.

      Basically it would be insanity.

      SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.

      So, the fundamental abstraction still holds true.

      And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.

      So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.

    • Our modern system has sort of achieved what microkernel was set out to do. Our Storage and Network each has their own OS.

    • And if the entire purpose of computer programming is to control and/or reduce complexity, I should think the discipline would be embarrassed with the direction in which the industries have been going the past several years. AWS alone should serve as an example.

      4 replies →

    • An interesting approach would be to standardize a way to program the controllers in flash disks, maybe something similar to OpenFirmware. Mainframes farm out all sort of IO to secondary processors and it was relatively common to overwrite the firmware in Commodore 1541 drives, replacing the basic IO routines with faster ones (or with copy protection shenanigans). I'm not sure anyone ever did that, but it should be possible to process data stored in files without tying up the host computer.

      3 replies →

  • Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology. But in the enterprise SSD space, there's a lot of experimentation with exactly this kind of thing. One of the most popular right now is zoned namespaces, which separates write and erase operations but otherwise still abstracts away most of the esoteric details that will vary between products and chip generations. That makes it a usable model for both flash and SMR hard drives. It doesn't completely preclude dishonest caching, but removes some of the incentive for it.

    • There is no strong reason why a consumer SSD can't allow reformatting to a smaller normal namespace and a separate zoned namespace. Zone-aware CoW file systems allow efficiently combining FS-level compaction/space-reclamation with NAND-level rewrites/write-leveling.

      I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.

      4 replies →

    • > Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology.

      From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.

      That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.

      1 reply →

    • If anything consumer-level SSDs move to the opposite direction. On Samsung 980 Pro it is not even possible to change the sector size from 512 bytes to 4K.

  • It's called the program-erase model. Some flash devices do expose raw flash, although it's then usually used by a filesystem (I don't know if any apps use it natively).

    There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.

    You almost certainly want to just buy a reliable device.

    Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.

    Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.

    Oh and all these things depend on every NAND design/process from each NAND manufacturer.

    And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.

    I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.

  • > Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

    I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?

    • Ultimately the problem is on the review side. Probably because there's no money in it. There just aren't enough channels to sell that kind of content into, and it all seems relatively celebrity driven. That said, I bet there's room for a YouTube personality to produce weekly 10 minute videos where they torture hard-drives old and new - and torture them properly, with real scientific/journalistic integrity. So, basically you need to be an idealistic outspoken nerd and a little money to burn on HDDs and audio/video setup. Such a person would definitely have such a "column" included in their reviews!

      (And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)

    • >I don't know, maybe if there was a "my files exist after the power goes out" column on the website

      more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.

      2 replies →

  • The flip side of the tyranny of the hardware flash controller is that the user can't reliably lose data even if they want to. Your super secure end to end messaging system that automatically erases older messages is probably leaving a whole bunch of copies of those "erased" messages laying around on the raw flash on the other side of the hardware flash controller. This can create a weird situation where it is literally impossible to reliably delete anything on certain platforms.

    There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.

  • > Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

    Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.

    https://youtu.be/36myc8wQhLo

    • Flash storage is incredibly complex in the extreme at the low level. The very fact we're talking about microcontroller flash as if it's even the same ballpark as NVMe SSDs in terms of complexity or storage management says a lot on its own about how much people here understand the subject (including me.)

      I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.

      The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.

      "Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...

      You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.

  • I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.

    • This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.

      And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.

    • IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.

  • "Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out."

    Clearly the vendors are at odds with the law, selling a storage device that doesn't store.

    I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.

    We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.

    • I agree, all the way up to entire generations of SDRAM being unable to store data at their advertised speeds and refresh timings. (Rowhammer.) This is nothing short of fraud; they backed the refresh off WAY below what's necessary to correctly store and retrieve data accurately regardless of adjacent row access patterns. Because refreshing more often would hurt performance, and they all want to advertise high performance.

      And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.

    • I don't know if a legal angle is the most helpful, but we probably need a Kyle Kingsbury type to step into this space and shame vendors who make inaccurate claims.

      Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.

      2 replies →

    • This isn't fraud.

      The tester is running the device out of spec.

      The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.

      If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.

      5 replies →

  • Nothing says that you can't both offload everything to hardware, and have the application level configure it. Just need to expose the API for things like FLUSH behavior and such...

    • Yeah, you're absolutely right. I'd prefer that the world dramatically change overnight, but if that's not going to happen, some little DIP switch on the drive that says "don't acknowledge writes that aren't durable yet" would be fine ;)

  • > the embedded system model where you get flash hardware that has two operations -- erase block and write block

    > just attach commodity dumb flash chips to our computer

    I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.

    Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.

  • I think the main reason they did not expose lower level semantics is that the wanted a drop in replacement for hdds. The second is liability: unfettered access ti arbitrary location erases (and writes) can let you kill (wear out) a flash device in a really short time.

I've actually run into some data loss running simple stuff like pgbench on Hetzner due to this -- I ended up just turning off write-back caching at the device level for all the machines in my cluster:

https://vadosware.io/post/everything-ive-seen-on-optimizing-...

Granted I was doing something highly questionable (running postgres with fsync off on ZFS) It was very painful to get to the actual issue, but I'm glad I found out.

I've always wondered if it was worth pursuing to start a simple data product with tests like these on various cloud providers to know where these corners are and what you're really getting for the money (or lack thereof).

[EDIT] To save people some time (that post is long), the command to set the feature is the following:

    nvme set-feature -f 6 -v 0 /dev/nvme0n1

The docs for `nvme` (nvme-cli package, if you're Ubuntu based) can be pieced together across some man pages:

https://man.archlinux.org/man/nvme.1

https://man.archlinux.org/man/nvme-set-feature.1.en

It's a bit hard to find all the NVMe features but 6 is the one for controlling write-back caching.

https://unix.stackexchange.com/questions/472211/list-feature...

  • From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.

    If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.

    That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.

    • > From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.

      Yeah I thought the same initially which is why I was super confused --

      > If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.

      Gulp.

      > That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.

      It looks like I'm going to have to do some more experimentation on this -- maybe I'll get a fresh machine and try to reproduce this issue again.

      What led me to NVMe as dropping write was the complete lack of errors on the pg and OS side (dmesg, etc).

I think this is something LTT could handle with their new test lab. They already said they want to set new standards when it comes to hardware testing, so if they can hold up to what they promised and hire enough experts this should be a trivial thing to add to a test Parcours for disk drives.

  • LTT's commentary makes it difficult to trust they are objective (even if they are).

    I loved seeing how giddy Linus got while testing Valve's Steamdeck, but when it comes to actual benchmarks and testing, I would appreciate if they dropped the entertainment aspect.

    • GamersNexus seems to really be trying to work on improving and expanding their testing methodology as much as possible.

      I feel like they've successfully developed enough clout/trust that they have escaped the hell of having to pull punches in order to assure they get review samples.

      They eviscerated AMD for the 6500xt. They called out NZXT repeatedly for a case that was a fire hazard (!). Most recently they've been kicking Newegg in the teeth for trying to scam them over a damaged CPU. They've called out some really overpriced CPU coolers that underperform compared to $25-30 coolers. Etc.

      I bet they'd go for testing this sort of thing, if they haven't already started working on it already. They'd test it and then describe for what use cases it would be unlikely to be a problem vs what cases would be fine. For example, a game-file-only drive where if there's an error you can just verify the game files via the store application. Or a laptop that's not overclocked and only is used by someone to surf the web and maybe check their email.

      2 replies →

    • From the most recent WAN show at 2:22:52[0]:

      > for starters i think the lab is going to focus on written for its own content and then supporting our other content [mainly their unboxing videos]... or we will create a lab channel that we just don't even worry about any kind of upload frequency optimization and we just have way more basic, less opinionated videos, that are just 'here is everything you need to know about it' in video form if, for whatever reason, you prefer to watch a video compared to reading an article

      0: https://youtu.be/rXHSbIS2lLs?t=8572

  • I'd really like to see one of the popular influencers disrupt the review industry by coming up with a way to bring back high quality technical analysis. I'd love to see what the cost of revenue looks like in the review industry. I'm guessing in-depth technical analysis does really bad in the cost of revenue department vs shallow articles with a bunch of ads and affiliate links.

    I think the current industry players have tunnel vision and are too focused on their balance sheets. Things like reputation, trust, and goodwill are crucial to their businesses, but no one is getting a bonus for something that doesn't directly translate into revenue, so those things get ignored. That kind of short sighted thinking has left the whole industry vulnerable to up and coming influencers who have more incentive to care about things like reputation and brand loyalty.

    I've been watching LTT with a fair bit of interest to see if they can come up with a winning formula. The biggest problem is that in-depth technical analysis isn't exciting. I remember reading something many years ago, maybe from JonnyGuru, where the person was explaining how most visitors read the intro and conclusion of an article and barely anyone reads the actual review.

    Basically you need someone with a long term vision who understands the value you get from in-depth technical analysis and doesn't care if the cost of it looks bad on the balance sheet. Just consider it the cost of revenue for creating content and selling merchandise.

    The most interesting thing with LTT is that I think they've got the pieces to make it work. They could put the most relevant / interesting parts of a review on YouTube and skew it towards the entertainment side of things. Videos with in-depth technical analysis could be very formulaic to increase predictability and reduce production costs and could be monetized directly on FloatPlane.

    That way they build their own credibility for their shallow / entertaining videos without boring the core audience, but they can get some cost recovery and monetization from people that are willing to pay for in-depth technical analysis.

    I also think it could make sense as bait to get bought out. If they start cutting into the traditional review industry someone might come along and offer to buy them as a defensive move. I wonder if Linus could resist the temptation of a large buyout offer. I think that would instantly torpedo their brand, but you never know.

    • I use rtings.com every time I buy a monitor.

      https://www.rtings.com/monitor/tools/table

      They rigorously test their hardware and you can filter/sort by literally hundreds of stats.

      I just built a PC and I would have killed for a site that had apples-to-apples benchmarks for SSDs/RAM/etc. Motherboard reviews especially are a huge joke. We're badly missing a site like that for PC components.

      9 replies →

    • I mentioned this in another comment, but I think GamersNexus is doing exactly what you want.

      Regarding influencers: they're being leveraged by companies precisely because they are about "the experience", not actual subjective analysis and testing. 99% of the "influencers" or "digital content creators" don't even pretend to try to do analysis or testing, and those that do generally zero in on one specific, usually irrelevant, thing to test.

      1 reply →

    • You wrote: <<bring back high quality technical analysis>>

      How about Tom's Hardware and AnandTech? If they don't count, who does? Years ago, I used to read CDRLabs for optical media drives. Their reviews were very scientific and consistent. (Of course, optical media is all but dead now.)

  • LTT is more focused on entertaining the audience than providing thorough, professional testing.

    • He’s recently pivoted a ton of his business to proper lab testing, and is hiring for it. It’ll be interesting to see, I think he might strike a better balance for those types of videos (I too am a bit tired of the clickbait nature these days).

      4 replies →

    • But audience is also important. If it is only super-technical sources that are reporting faulty drives then the manufactures won't care much. However if you get a very popular source that has a lot of audience, especially in the high-margin "gamer" vertical then all of a sudden the manufactures will care a lot.

      So if LTT does start providing more objective benchmarks and reviews it could be a powerful market force.

  • I would personally leave this kind of testing to the pros, like Phoronix, Gamers Nexus, etc. LTT is a facade for useful performance testing and understanding of hardware issues.

I used to developed SSD firmware in the past and our team always used to make sure it would write the data and check the write status. We also used to used to analyze competitor products using bus analyzers and could determine some wouldn't do that. Also in the past many OS filesystems would ignore many errors we returned anyway.

Edit: Here is an old paper on the subject of OS filesystem error handling.

https://research.cs.wisc.edu/wind/Publications/iron-sosp05.p...

The important quote:

> The models that never lost data: Samsung 970 EVO Pro 2TB and WD Red SN700 1TB.

I always buy the EVO Pro’s for external drives and use TB to NVMe bridges and they are pretty good.

  • There is a 970 Evo, a 970 Pro and a 970 Evo Plus, but no 970 Evo Pro as far as I am aware. Would be interesting what model OP is actually talking about and if it is the same for other Samsung NMVe SSDs. I also prefer Samsung SSDs because they are reliably and they usually don't change parts to lower spec ones while keeping the same model number like some other vendors.

    • And watch out with the 980 Pro, Samsung has just changed the components.

      Samsung have removed the Elpis controller from the 980 PRO and replaced it with an unknown one, and also removed any speed reference from the spec sheet.

      Take a look here for what's changed on the 980 PRO: https://www.guru3d.com/index.php?ct=news&action=file&id=4489...

      It's OK for them to do this, but then they should give the new product a new name, not re-use the old name so that buying it becomes a "silicon lottery" as far as performance goes.

      3 replies →

    • I mostly buy Samsung Pro. Today I put an Evo in a box which I'm sending back for RMA because of damaged LBAs. I guess I'm stopping my tests on getting anything else but the Pros.

      But IIRC Samsung was also called out for switching controllers last year.

      "Yes, Samsung Is Swapping SSD Parts Too | Tom's Hardware"

I'm curious whether the drives are at least maintaining write-after-ack ordering of FLUSHed writes in spite of a power failure. (I.e., whether the contents of the drives after power loss are nonetheless crash consistent.) That still isn't great, as it messes with consistency between systems, but at least a system solely dependent on that drive would not suffer loss of integrity.

Enterprise drives with PLP (power loss protection) are surprisingly affordable. I would absolutely choose them for workstation / home use.

The new Micron 7400 Pro M.2 960GB is $200, for example.

Sure, the published IOPS figures are nothing to write home about, but drives like these 1) hit their numbers every time, in every condition, and 2) can just skip flushes altogether, making them much faster in uses where data integrity is important (and flushes would otherwise be issued).

So, seems those drives may have been ignoring the F_FULLFSYNC after all…

https://news.ycombinator.com/item?id=30371857

The Samsung EVO drives are interesting because they have a few GB of SLC that they use as a secondary buffer before they reflush to the MLC.

  • > reflush to the MLC

    I'm nitpicking, but an EVO has TLC. Also an SLC write cache is the norm for any high performance consumer ssd, it's not just Samsung.

    • > I'm nitpicking, but an EVO has TLC.

      b...but the M in MLC stands for multi... as in multiple... right?

      checks

      Oh... uh; Apparently the obvious catch-all term MLC actually only refers to dual layer cells, but they didn't call it DLC, and now there's no catch-all term for > SLC. TIL.

      1 reply →

    • Thanks, I thought this was a special Samsung feature. They certainly advertise it as such!

  • The two vendors he tested as not ignoring FLUSHes are precisely the two vendors I was comparing to Apple, so not so fast.

Samsung has a hardware testing lab where all new storage products (SSDs/memory cards) are rigorously put to (automated) tests through a ridiculous number of reads, writes and power scenarios. The numbers are then averaged out and dialed down a bit to provide some buffer and finally advertised on the models. I'm not surprised that they maintain data integrity. They also own their entire stack (software and hardware) so there is less scope for a untested OEM bug to slip through.

"Data loss occurred with a Korean and US brand, but it will turn into a whole "thing" if I name them so please forgive me."

This does a disservice to those who might be running drives from those vendors with an expectation that they don't lose data post-flush.

That said, this narrows one of the data losers down to Hynix. Curious about the other one, considering how many US-based SSD vendors there are.

  • That said, this narrows one of the data losers down to Hynix.

    Not really. Samsung builds a plethora of SSDs.

    • Per the title, four vendors were tested. Samsung was already mentioned as a non-loser, so it can't be one of the two losers (or else the title would be wrong and the SSDs would be from 3 vendors at most).

      2 replies →

  • Nobody ahould be expecting that a flush actually flushes because the biggest manufacturer of hard drives tells you it doesn't.

    Read documents and specifications like this tester didn't do.

    And don't use random enclosures and pull the plug since the design spec assumes hold up times and sequencing that enclosure may not be compliant with.

    • Please stop replying with misinformation all over this thread.

      The NVMe spec is available for free; you should read it.

      And you're 100% wrong about the enclosure too. It's driven by an Intel TB bridge JHL6240 and the drives are PCIe NVMe m.2 devices. Power specs are identical to on-board m.2 slots with PCIe support (which is all modern ones). There is no USB involved.

      See my other reply to you where I explain what Flush actually does (your comments about it are also completely wrong).

      1 reply →

I'm a systems engineer but I've never done low level optimizations on drives. How does one even go about even testing something like this? It sounds like something cool that I'd like to be able to do

  • My script repeatedly writes a counter value "lines=$counter" to a file, then calls fcntl() with F_FULLFSYNC against that file descriptor which on macOS ends up doing an NVMe FLUSH to the drive (after sending in-memory buffers and filesystem metadata to the drive).

    Once those calls succeed it increments the counter and tries again.

    As soon as the write() or fcntl() fail it prints the last successfully written counter value which can be checked against the contents of the file. Remember: the semantics of the API and the NVMe spec require that a successful return from fcntl(fd, F_FULLFSYNC) on macOS require that data is durable at that point no matter what filesystem metadata OR drive internal metadata is needed to make that happen.

    In my test while the script is looping doing that as fast as possible I yank the TB cable. The enclosure is bus powered so it is an unceremonious disconnect and power off.

    Two of the tested drives always matched up: whatever the counter was when write()+fcntl() succeeded is what I read back from the file.

    Two of the drives sometimes failed by reporting counter values < the most recent successful value, meaning the write()+ fcntl() reported success but upon remount the data was gone.

    Anytime a drive reported a counter value +1 from what was expected I still counted at that as success... after all there's a race window where the fcntl() has succeeded but the kernel hasn't gotten the ACK yet. If disconnect happens at that moment fcntl() will report failure even though it succeeded. No data is lost so that's not a "real" error.

    • On very recent Linux kernels you can open the raw NVMe device and use the NVMe pass thru ioctl to directly send NVMe commands (or you can use SPDK on essentially any Linux kernel) and bypass whatever the fsync implementation is doing. That gives a much more direct test of the hardware (and some vendors have automated tests that do this with SPDK and ip power switches!). There's a bunch of complexity around atomicity of operations during power failure beyond just flush that have to get verified.

      But the way you tested is almost certainly valid.

    • Is it possible the next write was incomplete when the power cut out? Wouldn't this depend on how updates to file data are managed by the filesystem? The size and alignment of disk and filesystem data & metadata blocks?

      1 reply →

    • This seems like it would only work with with an external enclosure setup. I wonder if a test could be performed in the usual NVMe slot.

      Of course, it seems it would be much harder to pull main power for the entire PC. I'm not sure how you'd do that - maybe high speed camera, high refresh monitor to capture the last output counter? Still no guarantee I'm afraid.

      3 replies →

  • Write and flush and unplug the cable!

    • which is more difficult (and sometimes slower) than STONITH style devices which just kill power to the entire machine. The latter allow you to program the whole thing and run test cycle after test cycle where the device kills itself the moment it gets a successful flush.

The problem is you can't trust a model number of SSD. They change controllers, chips, etc after the reviews are out and they can start skimping on components.

https://www.tomshardware.com/news/adata-and-other-ssd-makers...

  • This needs to be cracked down on from a consumer protection lens. Like, any product revision that could potentially produce a different behavior must have a discernable revision number published as part of the model number.

    • >Like, any product revision that could potentially produce a different behavior must have a discernable revision number published as part of the model number.

      AFAIK samsung does this, but it doesn't really help anyone except enthusiasts because the packaging still says "980 PRO" in big bold letters, and the actual model number is something indecipherable like "MZ-V8P1T0B/AM". If this was a law they might even change the model number randomly for CYA/malicious compliance reasons. eg. firmware updated? new model number. DRAM changed, but it's the same spec? new model number. changed the supplier for the SMD capacitors? new model number. PCB etchant changed? new model number.

      7 replies →

    • The PC laptop manufacturers have worked around this for decades by selling so many different short-lived model numbers that you can rarely find information about the specific models for sale at a given moment.

      6 replies →

    • Right.

      And no switching the chipset to a different supplier requiring entirely different drivers between the XYZ1001 and the XYZ1001a, either.

      If I ruled the world I'd do it via trademark law: if you don't follow my set of sensible rules, you don't get your trademarks enforced.

      6 replies →

    • While I agree with the sentiment, even a firmware revision could cause a difference in behavior and it seems unreasonable to change the model number on every firmware release.

      13 replies →

    • It's complicated. Nowadays we have shortage of electronic components and it's difficult to know what will be not available the next month. So it's obvious that manufacturers have to make different variants of a product that can mount different components.

      2 replies →

    • What if it's not a board revision, just a part change?

      What if it wasn't at the manufacturer's discretion; the assembler just (knowingly or unknowingly) had some cheaper knock-off in?

    • usually the manufacturers are careful not to list official specs that these part swaps affect. all you get is a vague "up to" some b/sec or iops.

    • I don't want to live in a world where electronic components can't be commoditized because of fundamentally misinformed regulation.

      There are alternatives to interchangeable parts, and none of them are good for consumers. And that is what you're talking about - the only reason for any part to supplant another in feature or performance or cost is if manufacturers can change them !

  • This practice is false advertising at a minimum, and possibly fraud. I'm shocked there hasn't been State AG or CFPB investigations and fines yet.

    Edit: Being mad and making mistakes go hand in hand. FTC is the appropriate organization to go after these guys.

    • >or CFPB investigations and fines yet

      >CFPB

      "The Consumer Financial Protection Bureau (CFPB) is an agency of the United States government responsible for consumer protection in the financial sector. CFPB's jurisdiction includes banks, credit unions, securities firms, payday lenders, mortgage-servicing operations, foreclosure relief services, debt collectors, and other financial companies operating in the United States. "

      2 replies →

    • It's definitely fraud. The only reason to hide the things they do is to mislead the customer as evidenced by previous cases of this that caused serious harm to consumers.

    • What do you expect? These companies are making toys for retail consumers. If you want devices that guarantee data integrity for life or death, or commercial applications, those exist, come with lengthy contracts, and cost 100-1000x more than the consumer grade stuff. Like I seriously have a hard time empathizing with someone who thinks they are entitled to anything other than a basic RMA if their $60 SSD loses data

      2 replies →

  • this is even worse in automotive ECUs. this shortage is only going to make things more difficult to test and forget about securing.

Relevant recent discussion about Apple's NVMe being very slow on FLUSH.

Apple's custom NVMes are amazingly fast – if you don't care about data integrity

https://news.ycombinator.com/item?id=30370551

  • Not really a problem when your computer has a large UPS built into it. Desktop macs, not so good.

    But really isn’t the point of a journaling file system to make sure it is consistent at one guaranteed point in time, not necessarily without incidental data loss.

    • > Not really a problem when your computer has a large UPS built into it.

      Actually it is (through a small one) to name some examples where it can still lose without full sync:

      - OS crashes

      - random hard reset, e.g. due to bit flips due to e.g. cosmic radiation (happens). Or someone putting their magnetic earphone cases or similar on your laptop or similar.

      Also any application which care about data integrity will do full syncs and in turn will get hit by a huge perf. penalty.

      I have no idea why people are so adamant to defend Apple in this case, it's pretty clear that they messed up as performance with full flush is just WAY to low and this affects anything which uses full flushes, which any application should at least do on (auto-)safe.

      The point of a journalism file system is about making it less likely the file system _itself_ isn't corrupted. Not that the files are not corrupted if they don't use full sync!

      6 replies →

    • Hard drive write caches are supposed to be battery-backed (i.e., internal to the drive) for exactly this reason. (Apparently the drives tested are not.) Data integrity should not be dependent on power supply (UPS or not) in any way; it's unnecessary coupling of failure domains (two different domains nonetheless -- availability vs. integrity).

      7 replies →

    • > Not really a problem when your computer has a large UPS built into it.

      Except that _one time_ you need to work until the battery fails to power the device, at 8%, because the battery's capacity is only 80%. Granted, this is only after a few years of regular use...

      15 replies →

  • And the two vendors I tested as faster than Apple are precisely the two vendors OP found to be reliable, so my findings still stand.

More complex systems are liable to create more complex problems... I don't think you can get away from this - yes, can solve a problem, but if you model problems as entropy, increasing complexity increases entropy.

It's like the messy room problem - you can clean your room (arguably high entropy state), but unless you are exceedingly careful doing so increases entropy. You merely move whatever mess to the garbage bin, expend extra heat, increase your consumption in your diet, possibly break your ankle, stress your muscles. https://drbrainpharma.com/product/buy-mdma-crystal-in-europe...

As a frame of reference, how much loss of FLUSH'd data should be expected on power loss for a semi-permanent storage device (including spinning-platter hard drives, if anyone still installs them in machines these days)?

I'm far more used to the mainframe space where the rule is "Expect no storage reliability; redundancy and checksums or you didn't want that data anyway" and even long-term data is often just stored in RAM (and then periodically cold-storage'd to tape). I've lost sight of what expected practice is for desktop / laptop stuff anymore.

  • The semantics of a FLUSH command (per NVMe spec) is that all previously sent write commands along with any internal metadata must be written to durable storage before returning success.

    Basically the drive is saying "yup, it's all on NAND - not in some internal buffer. You can power off or whatever you want, nothing will be lost".

    Some drives are doing work in response to that FLUSH but still lose data on power loss.

    • A flush command only guarantees, upon completion, that all writes COMPLETED prior to submission of the flush are non-volatile. Not all previously sent writes. NVMe base specification 2.0b section 7.1.

      That's a very important distinction. You can't assume just because a write completed before the flush that it's actually durable. Only if it completed before you sent the flush.

      I'm not very confident that software is actually getting this right all that often, although it probably is in this fsync test.

      2 replies →

  • > how much loss of FLUSH'd data should be expected on power loss for

    0%

    In enterprise you are expected to expect lost data, but only if your drive fails and needs to be replaced, or if it's not yet flushed.

  • None. If the drive responds that the data has been written, it is expected to be there after a power failure.

Most likely not valid.

What was the external enclosure?

You need an ATX 3.3V rail that has a 15ms hold up time and whatever other sequencing these devices were designed for.

The buck converter in a USB enclosure isn't going to cut it for a valid test.

  • Could you explain the question?

    As far as I understand (which is little more than this twitter thread) the flush command should only return a success response once any data has been written to non-volatile storage.

    If the storage still requires power after that point to maintain the data, that storage area is volatile, no?

    So if the device has returned success (and I'm not going to claim that they've ensured that it was the device returning success and not the adapter, or that they even verified what the response was - those seem like valid questions) presumably the power wind-down should not be an issue?

    That said, I presumed by "disconnect the cable" the test involved some extension cable from the motherboard straight to drive to make it easier to disconnect - would that therefore make it a valid test of the NVMe?

    • You understand incorrectly. Flush means data is in volatile DRAM in the device. That's how SSDs work.

      Extension cable from motherboard would certainly make it invalid. These devices are not hot swap and may expect power hold up and sequencing from the supply.

      2 replies →

Laptops, especially the likes of MacOS with T2 chip in which all I/o goes through T2 can do some clever things. It can essentially turn the underlying NVMe SSD into a battery backed storage. Even if the OS on main CPU crashes and dies, the T2 chip with its own independent OS can ensure SSD does a full flush before the battery runs out of power. Now, I don't know if Apple does this, but I sure hope they do. It would be great if they published the details so that even Linux on MacBook can do this well.

I have an update for everyone:

Models that lost writes in my test:

SK Hynix Gold P31 2TB SHGP31-2000GM-2, FW 31060C20

Sabrent Rocket 512 (Phison PH-SBT-RKT-303 controller, no version or date codes listed)

I've ordered more drives and will report back once I have results:

Intel 670p

Samsung 980

WD Black SN750

WD Green SN350

Kingston NV1

Seagate Firecuda 530

Crucial P2

Crucial P5 Plus

These are just my results in my specific test configuration, done by me personally for fun in my own time. I may have made mistakes or the results might be invalid for reasons not yet know. No warranties expressed or implied.

  • Crucial P5 Plus 1TB CT1000P5PSSD8, FW P7CR402: Pass

    Crucial P2 250GB CT250P2SSD8, FW P2CR046: Pass

    Kingston SNVS/250G, 012.A005: Pass

    Seagate Firecuda 530 PCIe Gen 4 1TB ZP1000GM30013, FW SU6SM001: Pass

    Intel 670p 1TB, SSDPEKNU010TZ, FW 002C: Pass

    Samsung 970 Evo Plus: MZ-V7S2T0, 2021.10: Pass

    Samsung 980 250GB MZ-V8V250, 2021/11/07: Pass

    WD Red: WDS100T1R0C-68BDK0, 04Sept2021: Pass

    WD Black SN750 1TB WDS100T1B0E, 09Jan2022: Pass

    WD Green SN350 240GB WDS240G20C, 02Aug2021: Pass

    Flush performance varies by 6x and is not necessarily correlated with overall perf or price. If you are doing lots of database writes or other workloads where durability matters don't just look at the random/sustained read/write performance!

    High flush perf: Crucial P5 Plus (fastest) and WD Red

    Despite being a relatively high end consumer drive the Seagate had really low flush performance. And despite being a budget drive the WD Green was really fast, almost as good as the WD Red in my test.

    The SK Hynix drive had fast flush perf at times, then at other times it would slow down. But it sometimes lost flushed data so it doesn't matter much.

No surprise here. I've never encountered a redundancy feature in storage that worked. Power failure, drive controller failure, connection failure - and data is kaput. Regardless of the promised.

How can it be so bleak? Can it be that nobody's data redundancy is real? Sure. If you don't test it, regularly, then by the hoary rules of computing it doesn't work.

  • It would be... interesting to run Jepsen on some million-dollar SANs and discover that none of them pass.

    • Don't know what a million-dollar SAN is. Everybody's data is worth a million to them.

      But any consumer-grade redundancy scheme (mirror, raid set, automatic backup) is likely useless.

How do these NVMe SSDs fare when setting the FUA or Force Unit Access bit for write through on Linux (O_DIRECT | O_DSYNC) instead of F_FULLFSYNC on macOS?

I imagine that different firmware machinery would be activated for FUA, and knowing whether FUA works properly would provide comfort to DB developers.

And that kids is how you bullshit your customers your product has more performance than in reality.

Aren't there filesystem options that affect this? Wasn't there a whole controversy over ext4 and another filesystem not committing changes even after flush (under specific options/scenarios)?

  • Yes, but this is worse. Even if your filesystem does everything perfectly, these SSDs will still lose data.

  • The ext4 "issue" was in userspace. Certain software wasn't calling fsync() and had assumed the data was written anyway, because ext3 was more forgiving.

    IIRC the "solution" was to give ext4 the ext3 semantics, i.e. not insist that every broken userspace program needed to be fixed.

    • This is an old topic, but I disagree that any program that doesn't grind your disk into dust is broken. Consistency without durability is a useful choice.

      1 reply →

He is trsting consumer grade devices which don‘t have power loss protection by design. That is a „feature“ for enterprise devices so they can increase the price for datacenter usage.

https://twitter.com/xenadu02/status/1496006341579751426?s=21

  • This has nothing to do with PLP. If the drive reports PLP then Flush is allowed to be a no-op because all acknowledged writes are durable by design - the OS need only wait for the data write and FS metadata writes to complete without needing to issue a special IO command. This is covered in 5.24.1.4 in the NVMe spec 2.0b

  • He is trusting that drives are conformant to their specs. This is an issue of non-conformance that increases marketable performance at the cost of data security. PLP is great, but in lieu of that the drives should be honest about the state of writes. How can you trust your data will be there after an ACPI shut down?

I'd be much more worried if it wasn't for one key piece of data 'MacOS' I'm not sure I trust the Xnu kernel-space to be reporting properly.

That being said I think enough others said it, buy proper hardware if you don't trust that your family photos are on your corporate cloud account.

Quote: "Then I manually yanked the cable. Boom, data gone."

He never said what cable he yanked. If he did it to internal one that goes from PSU to the drive then that's a very niche test, not relevant. But if he did it to the one from wall socket to the PC then yeah, then that's a good test.

Regarding this, first of all Raymond Chen warned us more than a decade ago that vendors lie all the time about their hardware capabilities. He had one test with exactly this on flushing, where the HDD driver (written by the manufacturer of course) was always returning S_OK, regardless of whatever. Do note that this was in a time when HDD's were common, not SSD's.

Secondly, I always buy a PSU that is more than double in power for the system. If, let's say, the system has a 500W requirement, my PSU for that system will be at least 1200W. This will definitely have big capacitors to keep your system alive 2 seconds after the power goes off. Those 2 seconds might seem small to us, but the drives, as much as lying assholes as they are to OS, will still correctly flush pending data. Never experienced data loss going this route.

I'm confident this is a means to cheat performance benchmarks, because some of those will run tests using those flushes to try and get 'real' hardware performance instead of buffered/cached performance.

I wonder if a small battery or capacitor on these devices would work to avoid data loss.

Surprised this is being rediscovered. Years ago only "enterprise" or "data center" SSDs had the supercap necessary to provide power for the controller to finish pending writes. Did we ever expect consumer SSDs to not lose writes on power fail?

  • It's not losing pending writes, it's the drive saying there are no pending writes, but losing them anyways. ie the drive is most likely lying

  • This is not about pulling power during writes. Flush is supposed to force all non-committed (i.e. cached) writes to complete. Once that has been acknowledged there is no need for any further writes. So those drives are effectively lying about having completed the flush. I also have to wonder when they intended to write the data...

I’m actually interested in testing this scenario, a drive getting power loss. Is there a thing which will cut power to a server device on command? Or do you just pull the drive out of its bay?

  • If your server has IPMI or similar, you can use that to cut power (not to the BMC though), otherwise, network PDUs or many UPSes or consumer level networked outlets or something fun with a relay (be careful with mains voltages).

    Pulling the drive is also worth testing though; you might get different results. Requires more human involvement though.

  • I don’t know of a way on most systems to cut the power directly. It seems fairly straightforward to wire up a relay though.

Would this affect filesystems like zfs? It seems a user space positive return syscall would mess all kinds of things up.

  • At best ZFS could detect the failure but a file system can’t save you from drives lying about whether data is actually on the disk.

It's worth noting PLP (Power Loss Protection) exists on Enterprise NVMEs to mitigate those issues.

I love the dedication on this guy, buying a dozen 2TB SSDs just to test them for FLUSH consistency.

« Data loss occured with a Korean and US brand, but it will turn into a whole "thing" if I name them so please forgive me. »

  • > The models that never lost data: Samsung 970 EVO Pro 2TB and WD Red SN700 1TB.

    The others would probably be SK Hynix and Micron/Crucial, right? Curious why he's reluctant to name and shame. A drive not conforming to requirements and losing data is a legitimate problem that should be a "thing"!

    • > Curious why he's reluctant to name and shame

      My sense is he wants to shame review sites for not paying attention to this rather than shame manufacturers directly at this point.

    • Crucial seems plausible, but there's a surprising number of US brands for NVMe SSDs. I was able to find: Crucial, Corsair, Seagate, Plextor, Sandisk, Intel, Kingston, Mushkin, PNY, Patriot Memory, and VisionTek.

      1 reply →

    • Looks like he works at Apple. Maybe what he's testing is work related or covered by some sort of NDA (e.g. doesn't want to risk harming supplier relations for the brands misbehaving)

    • I thought Crucial specifically designed some power loss protection as a differentiating selling point? Well at least that was the reason why I bought one back in M.2 days (gosh my PC is ancient...)

      7 replies →

In my opinion, as a consumer, this is up to you. If you need this, get a UPS battery backup (or a laptop which has its own battery). Or, you can get a super specialized SSD. Ultimately though, most consumer SSDs DON’T need this feature. And if they did include it by default, it would likely be environmentally questionable for a feature most people will never use (because most consumer SSDs these days go into laptops with their own batteries).

  • You didn't understand the issue. It's not that these drives lose data with sudden power loss. It's that you tell the drive "please write all data that is currently in your write cache to persistent storage now" and then the drive says "ok I'm all done, your data is safe" and then when you cut power AFTER this, your data is still sometimes gone. This has nothing to do with any batteries, or complicated technology. It just means make your drive not lie.

Correct me if I'm wrong, but if these drives are used for consumer applications, this behavior is probably not a big deal? If you made changes to a document, pressed control-S, and then 1 second later the power went out, then you might lose that last save. That'd suck, but you would have lost the data anyways if the power loss occurred 2s before, so it's not that bad. As long as other properties weren't violated (eg. ordering), your data should mostly be okay, aside from that 1s of data. It's a much bigger issue for enterprise applications, eg. a bank's mainframe responsible for processing transactions told a client that the transaction went through, but a power loss occurred and the transaction was lost.

  • Modern SSDs, and especially NVMe drives, have extensive logic for reordering both reads and writes, which is part of why they perform best at high queue depths. So it's not just possible but expected that the drive will be reordering the queue. Also, as batteries age, it becomes quite common to lose power without warning while on a battery.

    In general it's strange to hear excuses for this behavior since it's obviously an attempt to pass off the drive's performance as better than it really is by violating design constraints that are basic building blocks of data integrity.

    • >Modern SSDs, and especially NVMe drives, have extensive logic for reordering both reads and writes, which is part of why they perform best at high queue depths. So it's not just possible but expected that the drive will be reordering the queue.

      If we're already in speculation territory, I'll further speculate that it's not hard to have some sort of WAL mechanism to ensure the writes appear in order. That way you can lie to the software that the writes made it to persistent memory, but still have consistent ordering when there's a crash.

      >Also, as batteries age, it becomes quite common to lose power without warning while on a battery.

      That's... totally consistent with my comment? If you're going for hours without saving and only saving when the OS tells you there's only 3% battery left, then you're already playing fast and loose with your data. Like you said yourself, it's common for old laptops to lose power without warning, so waiting until there's a warning to save is just asking for trouble. Play stupid games, win stupid prizes. Of course, it doesn't excuse their behavior, but I'm just pointing out to the typical consumer, the actual impact isn't bad as people think.

  • It’s a big deal because they are lying. That sets false expectations for the system. There are different commands for ensuring write ordering.

  • > As long as other properties weren't violated (eg. ordering), your data should mostly be okay, aside from that 1s of data.

    That's the thing though—ordering isn't guaranteed as far as I remember. If you want ordering you do syncs/flushes, and if the drive isn't respecting those, then ordering is out of the window. That means FS corruption and such. Not good.

    • The tweet only mentioned data loss when you yanked the power cable. That doesn't say anything about whether the ordering is preserved. It's possible to have a drive that lies about data written to persistent storage, but still keeps the writes in order.

  • > If you made changes to a document, pressed control-S, and then 1 second later the power went out, then you might lose that last save.

    If you made changes to a document, pressed control-S, and then 1 second later the power went out, then the entire filesystem might become corrupted and you lose all data.

    Keep in mind that small writes happen a lot -- a lot a lot. Every time you click a link in a web page it will hit cookies, update your browser history, etc etc, all of which will trigger writes to the filesystem. If one of these writes triggers a modification to the superblock, and during the update a FLUSH is ignored and the superblock is in a temporary invalid state, and the power goes out, you may completely hose your OS.

  • Nope, the problem here is that it violates a very basic ordering guarantee that all kinds of applications build on top of. Consider all of the cases of these hybrid drives or just multiple hard drives where you fsync on one to journal that you do something on the other (e.g. steam storing actual games on another drive).

    This behavior will cause all kinds of weird data inconsistencies in super subtle ways.

  • > As long as other properties weren't violated (eg. ordering)

    That is primarily what fsync is used to ensure. (SCSI provides other means of ensuring ordering, but AFAIK they're not widely implemented.)

    EDIT: per your other reply, yes, it's possible the drives maintain ordering of FLUSHed writes, but not durability. I'm curious to see that tested as well. (Still an integrity issue for any system involving more than just one single drive though.)

  • > That'd suck, but you would have lost the data anyways if the power loss occurred 2s before,

    But if you knew power was failing, which is why you did the ^S in the first place, it would not just suck, it be worse than that because your expectations were shattered.

    It's all fine and good to have the computers lie to you about what they're doing, especially if you're in on the gag.

    But when you're not, it makes the already confounding and exasperating computing experience just that much worse.

    Go back to floppies, at least you know the data is saved with the disk stops spinning.

    • >But if you knew power was failing, which is why you did the ^S in the first place, it would not just suck, it be worse than that because your expectations were shattered.

      The only situation I can think of this being applicable is for a laptop running low on battery. Even then, my guess is that there is enough variance in terms of battery chemistry/operating conditions that you're already playing fast and loose with regards to your data if you're saving data when there's only a few seconds of battery left. I agree that that having it not lose data is objectively better than having it lose data, but that's why I characterized it as "not a big deal".

      2 replies →