Comment by marcan_42

4 years ago

> How are you measuring that

powermetrics gives you DRAM bandwidth per SoC block, before and after the system level caches.

> how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.

I mean they're not just being held off. It's doing something, not waiting.

> Yes. It is clear the hardware was never optimized for it.

This is a firmware issue. The controller runs on firmware. I can even tell you where to get it and you can throw it in a decompiler and see if you can find the issue, if you're so inclined :-)

> I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice.

Delaying the update does not explain 10MB/s of memory traffic. That means it's doing something, not waiting.

> It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.

Indeed, thus it's highly likely this is a dumb firmware bug, like the FLUSH implementation being really naive and nobody having cared until now because it wasn't a problem on devices where nothing flushes anyway.

> NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow.

Yup, it's not rocket science, it's humans writing code. And humans write bad code. Apple engineers write bad code too, just take a look at some parts of XNU ;-)

> Flushing data out of your cache into the log is the FTL's bread and butter.

Full flushes are rare on devices where the cache can be considered persistent anyway because there's a battery and the kernel is set up to flush on panics/emergency situations (which it is). Thus nobody ever ran into the performance problem, thus it never got fixed.

> It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.

The dumbest cache implementation is a big fixed size hash table. That's easy to background flush incrementally on capacity, but then if you want to do a full flush you end up having to do a linear scan even if the cache is mostly empty. And Apple have big SSD caches - on the M1 Max the NVMe carveout is almost 1 gigabyte. Wouldn't surprise me at all if there is some pathological linear scan going on in the case of host flush requests, or some other data structure issue. Or just an outright bug, a cache locality issue, or any other number of things that can kill performance. It's code. Code has bugs and performance issues.

6 comments

marcan_42

throwawaylinux 4 years ago

Right, so you don't really know what it's doing at all. That it does something different is expected.

> Indeed, thus it's highly likely this is a dumb firmware bug, like the FLUSH implementation being really naive and nobody having cared until now because it wasn't a problem on devices where nothing flushes anyway. I don't think that's highly likely at all. I think it's highly unlikely.

> Yup, it's not rocket science, it's humans writing code. And humans write bad code. Apple engineers write bad code too, just take a look at some parts of XNU ;-)

I'm not some Apple apologist. I think their fsync() thing is stupid (although very surprised you didn't know about it and took you so long to check the man page, it's a old and well known issue and I don't even use or program for OSX). The hardware is clearly not very good for the task of a non-battery PC (even on batteries I think it's a questionable choice unless they can flush data in case of OS crash or low battery shutdown. I also think their kernel is low performing and a poor Frankenstein mishmash of useless microkernel bits. So you're not getting me on that one.

> Full flushes are rare on devices where the cache can be considered persistent anyway because there's a battery and the kernel is set up to flush on panics/emergency situations (which it is). Thus nobody ever ran into the performance problem, thus it never got fixed.

I never said the hardware was suitable for this type of operation.

> The dumbest cache implementation is a big fixed size hash table. That's easy to background flush incrementally on capacity, but then if you want to do a full flush you end up having to do a linear scan even if the cache is mostly empty.

I can think of dumber. A linked list you have to search.

This approach is really bad even if you don't have any syncs because you still want to place LBAs linearly even on NAND otherwise your read performance on large blocks

The fact you can come up with stupid thing that might explain it isn't a very good argument IMO. Sure that might be the case I didn't say it was impossible just didn't think it was likely. You're saying it's certainly the case. Don't think there's enough evidence, at best.

marcan_42 4 years ago
Look, it's just logic. There's a couple pages in cache. It has to flush them. Finding them and doing that doesn't take 10MB/s of memory traffic and 20ms unless you're doing something stupid. If it were a hardware problem with the underlying storage it wouldn't be eating DRAM bandwidth. The fact that it's doing that means it's doing something with the data in the DRAM carveout (cache) which is much larger/more complicated than what a good data structure would require to find the data to flush. The bandwidth should be .3MB/s plus a negligible bit of overhead for the data structure parts, which is the bandwidth of the data being written (and what you get if you do normal writes without flushing at the same rate). Anything above 1MB/s is suspicious, nevermind 10MB/s.
- throwawaylinux 4 years ago
  
  The logic is flawed though. You don't have the evidence or logic that it's certainly a bug or due to stupidity or oversight. I also don't know for certain that it's not which I'll acknowledge.
  And if it was a strange forward map structure that takes a lot of time to flush but is fast or small or easy to implement, that actually supports my statement. That it was a deliberate design choice. Not a firmware bug. Gather delay was one example I gave, not an exhaustive list.
  
  2 replies →