Comment by nbardy

4 days ago

It’s a bit misleading to say nothing special, as they are doing more than just increasing parameter count. Progress has been steady in all the sub components of training from data filtering and weighting to sparse attention, optimizers to up and down the stack various efficiency in training computing.

They’re using more compute, a bigger model and tons of training quality improvements to get more out of an equivalent model.