Comment by freediver
1 year ago
Not seeing major advance in quality with o1, but seeing major negative impact on cost and latency.
Kagi LLM benchmarking project:
1 year ago
Not seeing major advance in quality with o1, but seeing major negative impact on cost and latency.
Kagi LLM benchmarking project:
Kagi is most likely evaluating it mainly on deriving an answer for the user from search result snippets. Indeed, GPT-4o is plenty good at this already, and o1 would only perform better on particular types of hard requests, while being so much slower.
If you look at Appendix A in the o1 post [1], this becomes quite clear. There's a huge jump in performance in "puzzle" tasks like competitive maths or programming. But the difference on everything else is much less significant, and this evaluation is still focused on reasoning tasks.
The human preference chart [1] also clearly shows that it doesn't feel that much better to use, hence the overall reaction.
Everyone is complaining about exaggerated marketing, and it's true, but if you take the time to read what they wrote beyond the shallow ads, they are being somewhat honest about what this is.
[1] https://openai.com/index/learning-to-reason-with-llms/
The test has many reasoning, code and instruction following questions which I expected o1 to be excelling at. I do not have an interpretation for such poor results on our test, was just sharing them as a data point for people to make their own mind. My best guess at this point is that o1 is optimized for a very specific and narrow use case, similar to what you suggest.
hey buddy, you're talking to owner of kagi, and the kagi benchmark is a traditional one
My bad, you are right, should have looked into it better, I was too dismissive. Still I think that highlighting those charts from OpenAI is important.
interesting that Gemini performs extremely poor in those benchmarks.