← Back to context

Comment by jimkleiber

10 hours ago

How well does such llm research hold up as new models are released?

Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.