Comment by eachro

4 days ago

If you wanted to train it from scratch, how long would it take on a reasonable GPU setup?

For the sake of comparison, you can train a 124M model on a 3090 (see nanoGPT). In that case, each batch ends up having about 500,000 tokens and takes maybe around 10ish seconds to run forward and backward. Then the 6 trillion tokens that this model was trained on would take about 4 years, approximately. Or just "too long" for a shorter answer.

The world reasonable is vague but assuming you mean something that could be run in a residential unit it would long a very long time if training from pure scratch.

This is part of the rationale for releasing this model. Now you don't have to start from scratch and finetuning is reasonable on a wide variety of hardware, including reasonable GPU setups (and smaller)