Comment by rck

4 days ago

For the sake of comparison, you can train a 124M model on a 3090 (see nanoGPT). In that case, each batch ends up having about 500,000 tokens and takes maybe around 10ish seconds to run forward and backward. Then the 6 trillion tokens that this model was trained on would take about 4 years, approximately. Or just "too long" for a shorter answer.