Comment by TIPSIO
18 hours ago
I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.
Either way, if true, given the cost I wish I could opt-out or it were more transparent.
Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.
All speculation though
If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.
Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.
I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.