← Back to context

Comment by daveguy

3 hours ago

Really tired of you making up stuff about this. The baseline and entire benchmark evaluation is clearly defined, with a statistically sound number of participants for the baseline using the same consistent deterministic environments to perform evaluation. The fact you don't like where the "human performance" line was drawn or how the scale is derived is not the same as the benchmark being tested with "radically different inputs". Clearly you would rather hype AI than critically advance it. So I won't waste time with someone who is clearly not posting in good faith.

Byebye now.

Humans and LLMs are not seeing the benchmark in the same format. What's made up about that ? Can you solve this in the JSON format ?

Look man, don't reply if you don't want to.