Comment by alwillis
3 hours ago
> While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.
Apple’s advantage is their unified memory architecture where the CPU, GPU and Neural Engine share the same memory and the SSD is directly connected to the SoC--less latency than PCIe. Memory bandwidth starts at 300+ GB/s.
In an optimized implementation of model inference, the latency of SSD access has no importance, because no random accesses are done.
The purpose of optimizing model inference for weights stored on SSDs is to achieve a continuous reading from SSDs at the maximum throughput provided by hardware, taking care that any computations and any accesses to the main memory are overlapped over the SSDs reading.