Comment by AIPedant
19 hours ago
It's more like using a faulty and dangerous automated foundry to make steel when you could just hire steelworkers.
That's the real problem here - these companies are swimming in money and have armies of humans working around the clock training LLMs, there is no honest reason to nickel-and-dime the actual evaluation of benchmarks. It's like OpenAI using exact text search to identify benchmark contamination for the GPT-4 technical report. I am quite certain they had more sophisticated tools available.
No comments yet
Contribute on Hacker News ↗