Comment by smusamashah
6 months ago
We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.
6 months ago
We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.
There's always the new sets from Leaderboard v2 https://huggingface.co/spaces/open-llm-leaderboard/blog
The sample answers for the horse race question are crazy. [0] Pretty much all the LLM really want to split 6 horses into two groups of three.
Only LLAMA 3 makes the justification that only 2 horses can be raced at a time, but then gets its modified question wrong by racing three horses. I personally would consider an answer that presumes some restriction to how the horses can be raced to be valid if it answers the restricted version correctly.
[0]: https://arxiv.org/html/2405.19616v2#S9.SS2.SSS1
"The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language."
I think this benchmark would really only tell me whether Wolframs book was in the training data.
It's available online in HTML form, for free:
https://www.wolfram.com/language/elementary-introduction/3rd...
Yeah, may be should skip that benchmark.
I am happy to run the tests on Kagi LLM benchmark. Is there an API endpoint for this model anywhere?