Comment by NitpickLawyer

1 hour ago

Unless I'm wrong about the premise, the downstream tasks seem to find that 1T-1S is better than 4 or 8T-8S on a bunch of tasks. Doesn't that invalidate the whole population mix thing? (also the part about loras being "evolved" by changing stuff in a few seconds was a bit confusing to me, perhaps I misunderstood something)

1 comment

NitpickLawyer

AMavorParker 15 minutes ago

Thanks for your interest!

Not necessarily. While the held-out downstream evals showed that 1T-1S setups outperformed larger populations like 4T-4S or 8T-8S on some specific benchmarks, that does not invalidate the motivation for population-based training.

The main motivation for larger populations is more diversity in both problems and solutions, which can encourage specialization and broader task coverage. Even if that diversity does not improve on some of the particular benchmarks we used, it is still arguably a desirable property.

Figure 9 in the paper, for example, shows that students trained with larger populations are exposed to a much wider range of tasks than the baseline.

Also, on average, we do see that 4v4 is the best across all benchmarks we measure.

The “creating new population members in seconds” comment refers to operating in LoRA space. The mutation and crossover operators are applied to lightweight LoRA adapters rather than full model weights, making the process very fast and memory efficient.