Comment by sam_goody
5 hours ago
Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.
Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.
You never get "the same" Steph Curry, he might be tired, annoyed by a fan, getting older... but if he and I were to throw 100 3-pointers, we could all correctly guess who will perform better.
Good point.
But I use Codex and Claude daily (work and hobby respectively). And there are days where one or the other just seems to have gotten up on the wrong side of the bed. Or is just being lazy. Or is suddenly super-powered do everything including what i asked it not to. (To be fair, the same thing happens with myself. :/)
I am convinced that if I was bench-marking, I would be convinced these are different models on different days.
[This conviction may say more about me then about the model.]