Comment by TIPSIO

1 month ago

I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.

Either way, if true, given the cost I wish I could opt-out or it were more transparent.

Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.

All speculation though

2 comments

TIPSIO

F7F7F7 1 month ago

Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.

I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.

samusiam 1 month ago

If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.