Comment by mikewarot

2 months ago

Imagine an FPGA big enough to hold the whole model in LUTS (and not RAM) with latches in appropriate places to keep race conditions in check. Even a 100 Mhz clock cycle would beat almost anything else in the world running it. Even if there's 500 stages of pipeline involved, you could still get 200,000 tokens per second for a single stream and have 499 streams ready for other uses.

With an FPGA like that, you could translate all of the matrix multiplies and weights directly into binary logic, optimizing out every multiply or add of a zero bit. This alone could cut the number of gates and computations, and power consumption in half.

Because you wouldn't need to throw data to/from RAM, you'd save a huge percentage of the usual latency and eliminate memory bandwidth issues. The effective equivalent memory bandwidth would likely be measured in exabytes per second.

This is the type of compute load that would perfectly match a bit level systolic array.

3 comments

mikewarot

saagarjha 2 months ago

You'd need an insanely big FPGA for this.

mikewarot 2 months ago
Thanks to gigabit SERDES links, it should be reasonably easy to send the vectors between chips if you need to distribute the work to fit available FPGA hardware.
Note this could also be done if you're just emulating a systolic array on cheap hardware, like Raspberry pi picos, using the built-in PIOs to handle the much lower signal rates.
- saagarjha 2 months ago
  
  Gigabit is incredibly slow for training