Comment by voxgen

14 days ago

> Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?

This is the likely main explanation. RL fine-tuning repeatedly switches between inference to generate and score responses, and training on those responses. In inference mode they can parallelize across responses, but each response is still generated one token at a time. Likely 5+ minutes per iteration if they're aiming for 10k+ CoTs like other reasoning models.

There's also likely an element of strategy involved. We've already seen OpenAI hold back releases to time them to undermine competitors' releases (see o3-mini's release date & pricing vs R1's). Meta probably wants to keep that option open.

> see o3-mini's release date & pricing vs R1's

This backfires though, if OAI released o3-mini before DeepSeek-R1, R1 would be a lot less impactful.