← Back to context

Comment by ribit

2 years ago

Which hardware were these results obtained on? Are you talking about laptops or large multi-core workstations? I am not at all surprised that linking needs a lot of memory bandwidth (after all, it’s mostly copying), but we are talking about fairly small CPUs (6-8 cores) by modern standards. To fully saturate M3 Pro’s 150GB/s on a multicore workload you’d need to transfer ~8 bytes per cycle/core between L2 and the RAM on average, which is a lot for a compile workload. Maybe you can hit it during data transfer spikes but frankly, I’d be shocked if it turned out that 150GB/s is the compilation bottleneck.

Regarding mold… maybe it can indeed saturate 150GB/s on a 6-core laptop. But they were not using mold. Also, the timing differences we observe here are larger than what would be expected with a linker like mold with a 25% reduction in bandwidth. I mean mold can link clang in under 3 seconds. Reducing bandwidth would increase this by a second at most. We see much larger variation in M2 vs. M3 results here.

I was directly addressing the «saturate» part of the statement, not memory becoming the bottleneck. Since builds are inherently parallel nowadays, saturating the memory bandwidth is very easy since each CPU core runs a scheduled compiler process (in the 1:1 core to process mapping scenario), and all CPU cores suddenly start competing for memory access. This is true for all architectures and designs where memory is shared. The same reasoning does not apply to NUMA architectures but those are nearly entirely non-existant apart from certain fringe cases.

Linking, in fact, whilst benefitting from faster/wider memory is less likely to result in the saturation unless the linker is heavily and efficiently multithreaded. For insance, GNU ld is single threaded, gold is multi-threaded but Ian Taylor has reported very small performance gains from the use of multithreading in gold, and mold takes the full advantage of the concurrent processing. clang's lld is somewhere in between.

In M1/M2/M3 Max/Ultra, the math is a bit different. Each performance core is practically capped at the ~100Gb/sec memory transfer speed. Then the cores are organised into core clusters of n P and m E cores, and each core cluster is capped at the ~240Gb/sec speed. The accumulative memory transfer is ~400Gb/sec (800Gb/sec for the Ultra setup) for the entire SoC, but that is also shared with GPU, ANE and other compute acceleration cores/engines. Since each core cluster has multiple cores, a large parallel compilation process can saturate the memory bandwidth easily.

Code optimisation and type inference in strongly statically typed languages with polymorphic types (Haskell, Rust, ML and others) are very memory intensive, esp. at scale. There are multiple types of optimisation and most of them are of either the constraint solving or NP completeness type, but the code inlining coupled with the inter-procedural optimisations require very large amounts of memory on large codebases, and there are other memory bound optimisation techniques as well. Type inference for polymorphic types in the Hindley–Milner type system is also memory intensive due to having to maintain a large depth (== memory) in order to be able to successfully deduce the type. So it is not entirely unfathomable that «~8 bytes per cycle/core between L2 and the RAM on average» is rather modest for a highly optimising modern compiler.

In fact, I am of the opinion that the inadequate computing hardware coupled with the severe memory bandwidth and capacity constraints was a major technical contributing factor that led to the demise of the Itanium ISA (coupled with less advanced code optimisers of the day).