Comment by ojosilva
10 hours ago
Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.
No comments yet
Contribute on Hacker News ↗