Comment by freediver
5 months ago
Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.
https://help.kagi.com/kagi/ai/llm-benchmark.html
Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).
Overall a very nice update, you get higher quality and higher speed model at same price.
Hope to enable it in Kagi Assistant within 24h!
Thank you to the Kagi team for such fast turn around on new LLMs being accessible via the Assistant! The value of Kagi Assistant has been a no-brainer for me.
[flagged]
I find that giving encouraging messages when you're grateful is a good thing for everyone involved. I want the devs to know that their work is appreciated.
Not everything is a tactical operation to get more subscription purchases - sometimes people like the things they use and want to say thanks and let others know.
Some of us just actally really like kagi...
I'm surprised that Gemini 2.0 is first now. I remember that Google models were under performing on kagi benchmarks.
Having your own hardware to run LLMs will pay dividends. Despite getting off on the wrong foot, I still believe Google is best positioned to run away with the AI lead, solely because they are not beholden to Nvidia and not stuck with a 3rd party cloud provider. They are the only AI team that is top to bottom in-house.
I've used gemini for it's large context window before. It's a great model. But specifically in this benchmark it has always scored very low. So I wonder what has changed.
1 reply →
We should still wait around to see if Huawei is able to perfect its Ascend series for training and inferencing SOTA models.
This is a great take
Gemini 2 is really good, and insanely fast.
It's also insanely cheap.
It is, but in this benchmark gemini scored very poorly in the past.
How did you chose the 8192 token thinking budget? I've often seen Deepseek R1 use way more than that.
Arbitrary, and even with this budget it is already more verbose (and slower) overall than all the other thinking models - check tokens and latency in the table.
I see it in Kagi Assistant already and it's not even 24 hours! Nice.
One thing I don't understand is why Claude 3.5 Haiku, a non thinking model in the non thinking section, says it has a 8192 thinking budget.
Do you think kagi is the right Eval tool? If so,why?
The right eval tool depends on your evaluation task. Kagi LLM benchmark focuses on using LLMS in the context of information retrieval (which is what Kagi does) which includes measuring reasoning and instruction following capabilities.
Nice, but where is Grok?
Perhaps they're waiting for the Grok API to be public?
I thought o3-mini was o1-mini. OpenAI's naming gets confusing.