← Back to context

Comment by smusamashah

6 months ago

We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.

https://news.ycombinator.com/item?id=40179232

The sample answers for the horse race question are crazy. [0] Pretty much all the LLM really want to split 6 horses into two groups of three.

Only LLAMA 3 makes the justification that only 2 horses can be raced at a time, but then gets its modified question wrong by racing three horses. I personally would consider an answer that presumes some restriction to how the horses can be raced to be valid if it answers the restricted version correctly.

[0]: https://arxiv.org/html/2405.19616v2#S9.SS2.SSS1

"The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language."

I think this benchmark would really only tell me whether Wolframs book was in the training data.

I am happy to run the tests on Kagi LLM benchmark. Is there an API endpoint for this model anywhere?