Comment by gertlabs

2 hours ago

If you are referring to the parent post, yes, hard to draw conclusions from such a small sample size.

For our testing, we use hundreds of different environments across disciplines, and it seems to line up with subjective experience better than other benchmarks. We test coding, agentic coding, and non-coding reasoning in the environments.