← Back to context

Comment by AIPedant

19 hours ago

It's more like using a faulty and dangerous automated foundry to make steel when you could just hire steelworkers.

That's the real problem here - these companies are swimming in money and have armies of humans working around the clock training LLMs, there is no honest reason to nickel-and-dime the actual evaluation of benchmarks. It's like OpenAI using exact text search to identify benchmark contamination for the GPT-4 technical report. I am quite certain they had more sophisticated tools available.