← Back to context

Comment by jrockway

4 years ago

I always liked the embedded system model where you get flash hardware that has two operations -- erase block and write block. GC, RAID, error correction, etc. are then handled at the application level. It was never clear to me that the current tradeoff with consumer-grade SSDs was right. On the one hand, things like the error correction, redundancy, and garbage collection don't require the attention from CPU (and more importantly, doesn't tie up any bus). On the other hand, the user has no control over what the software on the SSD's chip does. Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.

It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

There's really two problems here:

1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)

2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.

There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)

  • Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?

    • HPC systems generally use LustreFS where you have multiple servers handling metadata and objects (files) separately. These servers have multiple level of drives, where metadata servers are SSD backed and file servers run on SSD accelerated spinning disk boxes, with a mountain of 10TB+ drives.

      When this structure is fed to a EDR/HDR/FDR Infiniband network, the result is a blazing fast storage system where you can make a massive number of random accesses by very large number of servers simultaneously. The whole structure won't shiver even.

      There are also other tricks Lustre can pull for smaller files to accelerate the access and reduce the overhead even further, too.

      In this model, the storage boxes are somewhat sandboxed, but the whole model as a general is mounted via its own client, so the OS is very close to the model Lustre provides.

      On the GPU servers, if you're going to provide big NVMe scratch spaces (a-la nVidia DGX systems), you soft-RAID the internal NVMe disks with mdadm.

      In both models, saturation happens on hardware level (disks, network, etc.) processors and other soft components doesn't impose a meaningful bottleneck even under high load.

      7 replies →

I can recommend the related talk "It's Time for Operating Systems to Rediscover Hardware". [1]

It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.

[1] https://www.youtube.com/watch?v=36myc8wQhLo

  • Fundamentally the job of the OS is resource sharing and scheduling. All the low level device management is just a side show.

    Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.

    Basically it would be insanity.

    SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.

    So, the fundamental abstraction still holds true.

    And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.

    So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.

  • Our modern system has sort of achieved what microkernel was set out to do. Our Storage and Network each has their own OS.

  • And if the entire purpose of computer programming is to control and/or reduce complexity, I should think the discipline would be embarrassed with the direction in which the industries have been going the past several years. AWS alone should serve as an example.

    • > And if the entire purpose of computer programming is to control and/or reduce complexity

      I honestly don’t know where you got that idea from. I always thought the whole point of computer programming was to solve problems. If it makes things more complex as a result, then so be it. Just as long as it creates fewer, less severe problems than it solves.

      3 replies →

  • An interesting approach would be to standardize a way to program the controllers in flash disks, maybe something similar to OpenFirmware. Mainframes farm out all sort of IO to secondary processors and it was relatively common to overwrite the firmware in Commodore 1541 drives, replacing the basic IO routines with faster ones (or with copy protection shenanigans). I'm not sure anyone ever did that, but it should be possible to process data stored in files without tying up the host computer.

    • http://la.causeuse.org/hauke/macbsd/symbios_53cXXX_doc/lsilo...

      But, its still an abstraction, and would have to remain that way unless your willing to segment it into a bunch of individual product categories, since the functionality of these controllers grows with the target market. AKA the controller on a emmc isn't anywhere similar to the controller on a flash array. So like GP-GPU programming, its not going to be a single piece of code because its going to have to be tuned to each controller, for perf as well as power reasons never mind functionality differences (aka it would be hard to do IP/network based replication if the target doesn't have a network interface).

      There isn't anything particularly wrong with the current HW abstraction points.

      This "cheating" by failing to implement the spec as expected isn't a problem that will be solved by moving the abstraction somewhere else, someone will just buffer write page and then fail to provide non volatile ram after claiming its non volatile/whatever.

      And its entirely possible to build "open" disk controllers, but its not as sexy as creating a new processor architectures. Meaning RISC-V has the same problems, if you want to talk to industry standard devices (aka the NVMe drive you plug into the RISC-V machine is still running closed source firmware, on a bunch of undocumented hardware).

      But take: https://opencores.org/projects?expanded=Communication%20cont... for example...

      2 replies →

Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology. But in the enterprise SSD space, there's a lot of experimentation with exactly this kind of thing. One of the most popular right now is zoned namespaces, which separates write and erase operations but otherwise still abstracts away most of the esoteric details that will vary between products and chip generations. That makes it a usable model for both flash and SMR hard drives. It doesn't completely preclude dishonest caching, but removes some of the incentive for it.

  • There is no strong reason why a consumer SSD can't allow reformatting to a smaller normal namespace and a separate zoned namespace. Zone-aware CoW file systems allow efficiently combining FS-level compaction/space-reclamation with NAND-level rewrites/write-leveling.

    I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.

    • Enabling those features on the drive side is little more than changing some #ifdef statements in the firmware, since the same controllers are used in high-end consumer drives and low-power data center drives. But that doesn't begin to address the changes necessary to make those features actually usable to a non-trivial number of customers, such as anyone running Windows.

      3 replies →

  • > Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology.

    From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.

    That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.

    • The current standard block storage abstraction presented by SSDs is a logical address space of either 512-byte or 4kB blocks (but pretty much always 4kB behind the scenes). Allocation is implicit upon writing to a block, and deallocation is explicit but optional. This model is indeed a good match for how virtual memory is handled, especially on systems with 4kB pages; there are already NVMe commands analogous to eg. madvise().

      The problem is that it's not a good match for how flash memory actually works, especially with regards to the extreme disparity between a NAND page write and a NAND erase block. Giving the OS an interface to query which blocks the SSD considers as live/allocated rather than deallocated and implicitly zero doesn't seem all that useful. Giving the OS an interface to manipulate the SSD's logical to physical mappings (while retaining the rest of the abstraction's features) would be rather impractical, as both the SSD and the host system would have to care about implementation details like wear leveling.

      Going beyond the current HDD-like abstraction augmented with optional hints to an abstraction that is actually more efficient and a better match for the fundamental characteristics of NAND flash memory requires moving away from a RAM/VM-like model and toward something that imposes extra constraints that the host software must obey (eg. append-only zones). Those constraints are what breaks compatibility with existing software.

  • If anything consumer-level SSDs move to the opposite direction. On Samsung 980 Pro it is not even possible to change the sector size from 512 bytes to 4K.

It's called the program-erase model. Some flash devices do expose raw flash, although it's then usually used by a filesystem (I don't know if any apps use it natively).

There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.

You almost certainly want to just buy a reliable device.

Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.

Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.

Oh and all these things depend on every NAND design/process from each NAND manufacturer.

And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.

I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.

> Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.

I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?

  • Ultimately the problem is on the review side. Probably because there's no money in it. There just aren't enough channels to sell that kind of content into, and it all seems relatively celebrity driven. That said, I bet there's room for a YouTube personality to produce weekly 10 minute videos where they torture hard-drives old and new - and torture them properly, with real scientific/journalistic integrity. So, basically you need to be an idealistic outspoken nerd and a little money to burn on HDDs and audio/video setup. Such a person would definitely have such a "column" included in their reviews!

    (And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)

  • >I don't know, maybe if there was a "my files exist after the power goes out" column on the website

    more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.

    • I wouldn't expect ordering of writes to be preserved, absent a specific way of expressing that need, part of a write cache's job is reordering writes to be more efficient which means ordering is not generally preserved.

      But then again, if they're willing to accept and confirm flush commands without flushing, I wouldn't expect them to actually follow ordering constraints.

      1 reply →

The flip side of the tyranny of the hardware flash controller is that the user can't reliably lose data even if they want to. Your super secure end to end messaging system that automatically erases older messages is probably leaving a whole bunch of copies of those "erased" messages laying around on the raw flash on the other side of the hardware flash controller. This can create a weird situation where it is literally impossible to reliably delete anything on certain platforms.

There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.

  • "Securely erased" has transformed into 1. encrypting all erasable data with a key and 2. "erasing" becomes throwing away the key.

    • But then you have to find a place to store the key that can be securely erased. Perhaps there is some sort of hardware enclave you can misuse. Even a tiny amount of securely erasable flash would be the answer.

      9 replies →

    • Great, we'll just store the key persistently on... Disk? Dammit! Ok, how about we encrypt the key with a user auth factor (like passphrase) and only decrypt the key in memory! Great. Now all we need to do is make sure memory is not persisted to disk for some unrelated reason. Wait...

      4 replies →

> Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.

Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.

https://youtu.be/36myc8wQhLo

  • Flash storage is incredibly complex in the extreme at the low level. The very fact we're talking about microcontroller flash as if it's even the same ballpark as NVMe SSDs in terms of complexity or storage management says a lot on its own about how much people here understand the subject (including me.)

    I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.

    The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.

    "Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...

    You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.

  • Apple’s SSDs are like that in some systems, and they’ve gotten flack for it.

    • No, they look like any normal flash drive actually. Traditionally, for any hard drive you can buy at the store, the storage controller exists on the literal NVMe drive next to the flash chips, mounted on the PCB, and the controller handles all the "meaty stuff", as that's what the OS talks to. The reason for this is obvious: because you can just plug it into an arbitrary computer, and the controller abstracts the differences from the vendors, so any NVMe drive works anywhere. The key takeaway is the storage controller exists "between" the two.

      Apple still has a flash storage controller that exists entirely separately from the host CPU, and the host software stack, just like all existing flash drives do today. The difference? The controller just doesn't exist on the literal, physical drive next to the flash chips. Because it doesn't exist; they just solder flash directly on the board without a mount like an M.2 drive. Again, no variability here, so it can all be "hard coded." And the storage controller instead exists by the CPU in the "T2 security chip", which also handles things like in-line encryption on the path from the host to the flash (which is instead normally handled by host software, before being put on the bus). It also does some other stuff.

      So it's not a matter of "architecture", really. The architecture of all T2 Macs which feature this design is very close, at a high level, to any existing flash drive. It's just that Apple is able to put the NVMe controller in a different spot, and their "NVMe controller" actually does more than that; it doesn't have to be located on a separate PCB next to the flash chips at all because it's not a general 3rd party drive. It just has to exist "on the path" between the flash chips and the host CPU.

I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.

  • This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.

    And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.

  • IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.

"Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out."

Clearly the vendors are at odds with the law, selling a storage device that doesn't store.

I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.

We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.

  • I agree, all the way up to entire generations of SDRAM being unable to store data at their advertised speeds and refresh timings. (Rowhammer.) This is nothing short of fraud; they backed the refresh off WAY below what's necessary to correctly store and retrieve data accurately regardless of adjacent row access patterns. Because refreshing more often would hurt performance, and they all want to advertise high performance.

    And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.

  • I don't know if a legal angle is the most helpful, but we probably need a Kyle Kingsbury type to step into this space and shame vendors who make inaccurate claims.

    Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.

  • This isn't fraud.

    The tester is running the device out of spec.

    The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.

    If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.

    • "The manufacturers warrant these devices to behave on a motherboard with proper power hold up times"

      Thats an interesting point, doesn't 'power failure' also include potential failure of the power supply, in which case you might not get that time?

      Or what if a new write command is issued withing the holdup time, does the motherboard /OS know about powerloss during those 16 milliseconds that the power is still holding?

      4 replies →

Nothing says that you can't both offload everything to hardware, and have the application level configure it. Just need to expose the API for things like FLUSH behavior and such...

  • Yeah, you're absolutely right. I'd prefer that the world dramatically change overnight, but if that's not going to happen, some little DIP switch on the drive that says "don't acknowledge writes that aren't durable yet" would be fine ;)

> the embedded system model where you get flash hardware that has two operations -- erase block and write block

> just attach commodity dumb flash chips to our computer

I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.

Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.

I think the main reason they did not expose lower level semantics is that the wanted a drop in replacement for hdds. The second is liability: unfettered access ti arbitrary location erases (and writes) can let you kill (wear out) a flash device in a really short time.

  • I could see that argument for sata SSDs, but the subject of the thread is NVME drives.

    • SATA vs NVMe vs SCSI/SAS only matters at the lowest levels of the operating system's storage stack. All the filesystem code and almost all of the block layer can work with any of those transports using the same HDD-like abstractions. Switching to a more flash-friendly abstraction breaks compatibility throughout the storage stack and potentially also with assumptions made by userspace.