Comment by mayerwin

5 hours ago

Yes, that is definitely a limitation. If all models become worse at the same pace, we won't see any degradation either. I couldn't find any historical dataset of model benchmarks (I'd really have loved that, to see how performance holds over time vs. the initial announcement), so the Elo data from Arena AI was the least imperfect proxy I could find.