Comment by svantana
3 months ago
Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.
What? It definitely is.
Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.
But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.