Comment by throwawaylinux

4 years ago

The dirty secret about today's high density NAND is that tPROG is not fast. It's an order of magnitude slower than the heyday of SLC. Now that doesn't really matter for enterprise drives, they complete writes into very fast storage that is made durable one way or another (e.g., flush on power fail), and this small store gets streamed out to the NAND log asynchronously. This is why random single queue depth durable writes can actually be faster than reads on enterprise drives, because random reads have to come from NAND (tREAD is still very fast, just not as fast as writing to DRAM).

Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff. So they might have to flush to NAND on any FUA, and that's slow as we've said, but not 25ms slow. Modern QLC NAND tPROG latency is more like 2.5ms-5ms, which could just about explain the EVO results when you include the OS and SATA stack and drive controller.

There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion. It would have been a deliberate design choice for some reason. One possible reason that comes to mind is that some drives gang a bunch of chips in parallel end you end up with pretty big "logical" pages. Flushing a big logical page on a 4kB write is going to cause a lot of write amp and drive wear, so you might delay for a short period (20ms) to try pick up other writes and reduce your inefficiency.

13 comments

throwawaylinux

marcan_42 4 years ago

Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate. The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.

> There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion

There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.

And then they put this hardware into desktops. Oops :-)

Lots of things about the M1 were rushed and have been fixed along the way. I wouldn't be in the least bit surprised if this were one more of them that gets fixed a couple macOS versions down the line, now that I've made some noise about it.

throwawaylinux 4 years ago
> Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate.
How are you measuring that and how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
> The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
Yes. It is clear the hardware was never optimized for it. Because it is so slow. I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice. It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow. Flushing data out of your cache into the log is the FTL's bread and butter. It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
Pull blocks from the buffer in order and build pages, allocate pages in NAND to send them, update forward map, repeat.
- marcan_42 4 years ago
  
  > How are you measuring that
  powermetrics gives you DRAM bandwidth per SoC block, before and after the system level caches.
  > how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
  I mean they're not just being held off. It's doing something, not waiting.
  > Yes. It is clear the hardware was never optimized for it.
  This is a firmware issue. The controller runs on firmware. I can even tell you where to get it and you can throw it in a decompiler and see if you can find the issue, if you're so inclined :-)
  > I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice.
  Delaying the update does not explain 10MB/s of memory traffic. That means it's doing something, not waiting.
  > It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
  Indeed, thus it's highly likely this is a dumb firmware bug, like the FLUSH implementation being really naive and nobody having cared until now because it wasn't a problem on devices where nothing flushes anyway.
  > NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow.
  Yup, it's not rocket science, it's humans writing code. And humans write bad code. Apple engineers write bad code too, just take a look at some parts of XNU ;-)
  > Flushing data out of your cache into the log is the FTL's bread and butter.
  Full flushes are rare on devices where the cache can be considered persistent anyway because there's a battery and the kernel is set up to flush on panics/emergency situations (which it is). Thus nobody ever ran into the performance problem, thus it never got fixed.
  > It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
  The dumbest cache implementation is a big fixed size hash table. That's easy to background flush incrementally on capacity, but then if you want to do a full flush you end up having to do a linear scan even if the cache is mostly empty. And Apple have big SSD caches - on the M1 Max the NVMe carveout is almost 1 gigabyte. Wouldn't surprise me at all if there is some pathological linear scan going on in the case of host flush requests, or some other data structure issue. Or just an outright bug, a cache locality issue, or any other number of things that can kill performance. It's code. Code has bugs and performance issues.
  
  6 replies →

mhh__ 4 years ago

I don't know what tPROG is (or anything else), is there a "What every programmer should know about storage" a la Drepper's work on memory?

throwawaylinux 4 years ago
tPROG is time it takes to program a NAND page from when you put the "program page" command on the pins to when you read off a successful status.
Some of the basic NAND guides they put out are simple enough to understand the basics of operation
https://www.micron.com/-/media/client/global/documents/produ...
The details get very complicated and proprietary. NAND wears out as you use it. But it also has a retention time. It gradually loses charge and won't read back if you leave it unpowered for long enough. This is actually where enterprise drives can be speced worse than consumer. So durability / lifetime is specified as meeting specified uncorrected error rates at the given retention period. The physics of NAND are pretty interesting too and how it translates into how a controller optimizes these parameters. Temperature at various stages of operation and retention changes properties, time between erase and program does too. You can adjust voltages on read, program, erase, and those can help you read data out or change the profile of the data. Reading can disturb parts of other pages (similar to rowhammer). Multilevel cells are actually interesting some of them you program in passes so that's a whole other spanner in the works.
I don't know of a good place that covers all that, but much beyond "read/program/erase + wear + retention" is probably beyond "what every programmer should know".
The way you turn a bunch of NAND chips that have a "read/program/erase" programming model into something that has a read/write model (the flash translation layer or FTL) is a whole other thing again though. And all the endurance management and optimization, error correction... Pretty fascinating details really. The basic details though is that they use the same concepts as the "log structured filesystem", turns out a log structure with garbage collection is about a perfect it for turning the program/erase model into a random write model. That's probably what every programmer should know about that (assuming you know something about LSFs -- garbage collection, write amplification, forward and reverse mapping schemes, etc).
- mhh__ 4 years ago
  
  What every programmer should know in this context is a euphemism for how Drepper views that set of things to know i.e. Yes it's hard and yes really should know, you're a professional programmer. Storage is a little bit further away than memory, but it's still very important in certain lines of work

SmellTheGlove 4 years ago

> Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff.

I disagree with this - my Apple is an enterprise device. It's a Macbook Pro, issued by my employer, to do real work. I wouldn't give Apple a pass on this dimension. I get that the "Pro" label doesn't mean what it used to, but these aren't toys either.