← Back to context

Comment by calaphos

19 hours ago

That is comparing an all to all switched Nvlink fabric to a 3D torus for TPUs. Those are completely different network topologies with different tradeoffs.

For example the currently very popular Mixture of Experts architectures require a lot of all to all traffic (for expert parallelism) which works a lot better on the switched NVlink fabric as opposed where it doesn't need to traverse multiple links in the torus.

This is an underrated point. Comparing just the peak bandwidth is like saying Bulldozer was the far superior CPU of the era because it had a really high frequency ceiling.

Really? Fully-connected hardware is in buildable (at scale) which we already know from the HPC world. Fat trees and dragonfly networks are pretty scalable, but a 3d torus is a very good tradeofff, and respects the dimensionality of reality.

Bisection bandwidth is a useful metric, but is hop count? Per-hop cost tends to be pretty small.

  • Latency (of different types), jitter, and guaranteed bandwidth are the real underlying metrics. Hop count is just one potential driver of those, but different approaches may or may not tackle each of these parts differently.