← Back to context

Comment by rsynnott

1 day ago

... I mean, when evaluating "45 + 8 minutes" where the expected answer was "63 minutes", as in the article, a competent human reviewer does not go "hmm, yes, that seems plausible, it probably succeeded, give it the points".

I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.

have you ever hired human evaluators at scale? They make all sorts of mistakes. Relatively low probability, so it’s a noise factor in, but I have yet to meet the human who is 100% accurate at simple tasks done thousands of times.

  • Which is why you hire them at scale as you say, then they are very reliable. LLM at scale are not.

    The problem with these AI models is there is no such point where you can just scale them up and they can solve problems as accurately as a group of humans. They add too much noise and eventually go haywire when left to their own devices.