PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

2 hours ago (vmax.ai)

3 comments

AMavorParker

Unless I'm wrong about the premise, the downstream tasks seem to find that 1T-1S is better than 4 or 8T-8S on a bunch of tasks. Doesn't that invalidate the whole population mix thing? (also the part about loras being "evolved" by changing stuff in a few seconds was a bit confusing to me, perhaps I misunderstood something)

AMavorParker 2 hours ago

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

tpoacher 1 hour ago

minor criticism. I haven't had a chance to read properly yet, but for a method that purports to be an evolutionary algorithm, it's missing all the formal language of the field. there's zero mention of a fitness function (let alone internal/external co-evolution ones), or a selection operator.

So my first impression is that either this is a non-evolutionary algorithm mascarading as one and diluting concepts like mutation and crossover that have well defined meanings, or it is one but you're abusing terminology from other fields (like RL and "rewards") instead. Either way it's a confusing first impression, and one gets the subtle vibe that word choices are more there to create a "buzz" than to create clarity.

(not trying to be dismissive, I genuinely hope this is useful feedback)

Paper does look interesting, I'll try to read properly when I have time.