Comment by codeinred

4 hours ago

We're at the point where LLMs and coding agents are supposed to do higher-level work. It makes sense to benchmark them against top human performance, rather than average human performance, because at specialized tasks, average human performance isn't enough.

The issues you described seem like they're actually strengths of the benchmark.