Comment by andy_xor_andrew

2 months ago

Not really? If you read it, there is no validation, no correctness signal, no verification, none of that. They're just passing in benchmark inputs, collecting the outputs (regardless of their quality), training on those outputs, and then sweeping the decode settings (temp, topk) of the resulting model. Their conclusion is that this results in a better model than the original - even when taking into consideration the same temp/topk sweep of the original.

So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."

Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.

  • Maybe this vaguely still makes sense in some way, because there is actually some useful signal purely in the model "internalizing" the behavior of its own sampler.

    I don't know enough to say anything more formal, but it feels like exposing the model to its own output might help it "learn" to work with the sampler to get to a goal. I know that this is partly one of the reasons why RL is helpful, because aside from shifting the output towards a specific reward (rlvr or rlhf) it's also the only place where things are optimized at an actual "end to end sampled sequence of tokens" level instead of "next logits level" like in pretraining (which is why the highest probability suffix completion isn't necessarily simply greedy highest logit choices)

They are training the model to 1. Produce code (as opposed to answer a question, write a poem, etc.) 2. Produce long enough output to be a valid solution. So they are doing exactly what I said. Cheers.