Comment by scotty79
8 hours ago
Single prompt performance is interesting because best agentic results of yesterday turned out to be best single prompt results of today.
If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.
Of course this particular benchmark is not really single prompt but rather "agentic without steering".
No comments yet
Contribute on Hacker News ↗