Comment by smusamashah

10 months ago

We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.

https://news.ycombinator.com/item?id=40179232

6 comments

smusamashah

There's always the new sets from Leaderboard v2 https://huggingface.co/spaces/open-llm-leaderboard/blog

The sample answers for the horse race question are crazy. [0] Pretty much all the LLM really want to split 6 horses into two groups of three.

Only LLAMA 3 makes the justification that only 2 horses can be raced at a time, but then gets its modified question wrong by racing three horses. I personally would consider an answer that presumes some restriction to how the horses can be raced to be valid if it answers the restricted version correctly.

[0]: https://arxiv.org/html/2405.19616v2#S9.SS2.SSS1

botro 10 months ago

"The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language."

I think this benchmark would really only tell me whether Wolframs book was in the training data.

bufferoverflow 10 months ago

It's available online in HTML form, for free:
https://www.wolfram.com/language/elementary-introduction/3rd...
smusamashah 10 months ago

Yeah, may be should skip that benchmark.

freediver 10 months ago

I am happy to run the tests on Kagi LLM benchmark. Is there an API endpoint for this model anywhere?