← Back to context

Comment by adamkinney

8 days ago

Right! If memorizing the upstream fix counts against the model, you're measuring how stale your benchmark is, not what the model can do.

The fix is only score on issues newer than the training cutoff, and rebuild the set every cycle. "Harden the prompt so it won't read git history" is testing instruction-following. Legitimate thing to measure, but it's a different than "can it fix the bug."

Reporting one number that blends the two is what makes the headline meaningless.