Comment by coldtea

4 days ago

>I don't see this as evidence that Opus 4.6 has gotten worse.

I see it as corroboration evidence of actual everyday experience.

Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

1 comment

coldtea

Reubend 3 days ago

> Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

They didn't list a sample size of runs, didn't show any numbers for variance across runs, etc...

So while they may have done that behind the scenes and just not told us, this doesn't seem like a rigorous analysis to me. It seems to me like people just want to find data that support the conclusion they already decided on (which is that Opus got worse).