Comment by bastawhiz

3 months ago

I don't really understand how Scout and Maverick are distillations of Behemoth if Behemoth is still training. Maybe I missed or misunderstood this in the post?

Did they distill the in-progress Behemoth and the result was good enough for models of those sizes for them to consider releasing it? Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?

Sorry if this is a naïve question.

3 comments

bastawhiz

paradite 3 months ago

My understanding is that they have a base model checkpoint for Behemoth from pre-training.

This base model is not instruction-tuned so you can't use it like a normal instruction-tuned model for chatbots.

However, the base model can be distilled, and then the distilled model is post-trained to be instruction tuned, which can be released as a model for chatbots.

voxgen 3 months ago

> Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?

This is the likely main explanation. RL fine-tuning repeatedly switches between inference to generate and score responses, and training on those responses. In inference mode they can parallelize across responses, but each response is still generated one token at a time. Likely 5+ minutes per iteration if they're aiming for 10k+ CoTs like other reasoning models.

There's also likely an element of strategy involved. We've already seen OpenAI hold back releases to time them to undermine competitors' releases (see o3-mini's release date & pricing vs R1's). Meta probably wants to keep that option open.

rfoo 3 months ago

> see o3-mini's release date & pricing vs R1's
This backfires though, if OAI released o3-mini before DeepSeek-R1, R1 would be a lot less impactful.