Comment by easygenes

10 hours ago

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.