← Back to context

Comment by NitpickLawyer

20 days ago

The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md

Good catch on the numbers. 29/33 vs 33/33 is the kind of gap that could easily be noise with that sample size. You'd need hundreds of runs to draw any meaningful conclusion about a 4-point difference, especially given how non-deterministic these models are.

This is a recurring problem with LLM benchmarking — small sample sizes presented with high confidence. The underlying finding (always-in-context > lazy-loaded) is probably directionally correct, but the specific numbers don't really support the strength of the claims in the article.