Comment by mananaysiempre
7 hours ago
115 MiB/s is something like 20 to 30 cycles per byte on a laptop, 50 on a desktop. That’s definitely quite slow as far as a CPU’s capacity to ingest bytes, but unfortunately about as fast as it gets for scalar (machine) code that does meaningful work per byte. There may be another factor of 2 or 3 to be had somewhere, or there may not be. If you want to go meaningfully faster, as in at least at the speed of your disk[1], you need to stop doing work per byte and start vectorizing. For parsers, that is possible but hard.
A quick rule of thumb is that one or two bytes per peak clock cycle per core or so (not unlike an old 8 bit or 16 bit machine!) is the worst case for memory bandwidth when running highly multithreaded workloads that heavily access main RAM outside cache. So there's a lot of gain to be had before memory bandwidth is truly saturated, and even then one can plausibly move to GPU-based compute and speed things up further. (Unified memory+HBM may potentially add a 2x or 3x multiplier to this basic figure, but either way it's in the ballpark.)