← Back to context

Comment by coffeefirst

8 hours ago

Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.

I can’t prove it but I suspect there’s a bit of that going on.

I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.

  • > it hallucinates like crazy but looks like its by design to boost benchmarks.

    Wasn’t there a discussion around some new-ish benchmark _punishing_ hallucination answers (over not replying at all) recently? Maybe in the not-so-distant future, this “spam replies until one’s correct” strategy won’t be able to game a benchmark much at all anymore.