Comment by svantana

5 months ago

Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.

1 comment

svantana

janalsncm 5 months ago

What? It definitely is.

Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.

But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.