Comment by Traster
1 day ago
Training is taking an enormous problem and trying to break it into lots of pieces and managing the data dependency between those pieces. It's solving 1 really hard problem. Inference is the opposite, it's lots of small independent problems. All of this "we have X many widgets connected to Y many high bandwidth optical telescopes" is all a training problem that they need to solve. Inference is "I have 20 tokens and I want to throw them at these 5,000,000 matrix multiplies, oh and I don't care about latency".
I can't think of any case where inference doesn't care about latency.
I cant thinl of any reason training isnt going to become real time with a significant cpu budget.