Comment by bsder

4 days ago

> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

And how is that an excuse?

I don't care about how good a model could be. I care about how good a model was on my run.

Consequently, my opinion on a model is going to be based around its worst performance, not its best.

As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.

3 comments

bsder

>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

> And how is that an excuse? […] this qualifies as strong evidence…

This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.

If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.

bsder 3 days ago

> If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.
I'm not running "statistics". I'm running an individual run. I care about the individual quality of my run and not the general quality of the "aggregate".
The problem here is that the difference may not be immediately observable. Sure, if it doesn't give a correct answer, that's quickly catchable. If it costs me 10x the time, that's not immediately catchable but no less problematic.

No, what they're saying is the previous run could have just been lucky and not representative!