Comment by janalsncm
3 months ago
What? It definitely is.
Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.
But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.
No comments yet
Contribute on Hacker News ↗