Comment by HarHarVeryFunny

2 months ago

Apparently a key part of this is not just to use the combination of high temperature (to boost fork diversity) and top-k (to truncate unwanted diversity at lock positions) sampling, but rather to use these settings to first generate a fine tuning dataset and then train on that. The fine tuning lets the model adapt it's weights to the new skewed distribution, which sounds a bit like an annealing process.

It does raise some questions:

1) Is this always a win for coding? The top-k truncation is also going to limit "fork" diversity. Maybe there is a better way to reshape the output probability distribution that sharpens the cutoff where it is already sharp (locks), without affecting it so much where it is more gradual (forks)?

2) Wouldn't this also benefit generation for other non-coding domains, which are generally also going to contain both "fork" and "lock" positions?