Comment by Reubend
4 days ago
> Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?
They didn't list a sample size of runs, didn't show any numbers for variance across runs, etc...
So while they may have done that behind the scenes and just not told us, this doesn't seem like a rigorous analysis to me. It seems to me like people just want to find data that support the conclusion they already decided on (which is that Opus got worse).
No comments yet
Contribute on Hacker News ↗