Comment by epistasis

17 hours ago

IMHO the benchmarks aren't useful, and ranking among the frontier models is mostly noise. The extra features around the coding agent have a much bigger impact on productivity than having to provide slightly more specification and guidance to the models; a 90% success rate versus a 92% success rate on the tasks I ask it to do is far more influenced by what I say than what the model is capable of.

0 comments