Comment by bob1029
4 years ago
I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.
The ECC and anything to do with multi or triple level cell flashes is quite non-trivial. You don’t want to have to think about these things if you don’t have to. But yes, better control over the flash controllers would be nice. There are alternative modes for NVMe like those specifically for key-value stores: https://nvmexpress.org/developers/nvme-command-set-specifica...
This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.
And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.
IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.
Have you had a look at ZoneFS? It exposes pretty much exactly that model to userspace: https://www.kernel.org/doc/html/latest/filesystems/zonefs.ht...
It does need support from the storage device though.