Comment by srdjanr

13 hours ago

It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).

Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.