Comment by freediver

5 months ago

Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.

https://help.kagi.com/kagi/ai/llm-benchmark.html

Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).

Overall a very nice update, you get higher quality and higher speed model at same price.

Hope to enable it in Kagi Assistant within 24h!

22 comments

freediver

jjice 5 months ago

Thank you to the Kagi team for such fast turn around on new LLMs being accessible via the Assistant! The value of Kagi Assistant has been a no-brainer for me.

hackernewds 5 months ago
[flagged]
- jjice 5 months ago
  
  I find that giving encouraging messages when you're grateful is a good thing for everyone involved. I want the devs to know that their work is appreciated.
  Not everything is a tactical operation to get more subscription purchases - sometimes people like the things they use and want to say thanks and let others know.
- zaphod420 5 months ago
  
  Some of us just actally really like kagi...

Squarex 5 months ago

I'm surprised that Gemini 2.0 is first now. I remember that Google models were under performing on kagi benchmarks.

Workaccount2 5 months ago
Having your own hardware to run LLMs will pay dividends. Despite getting off on the wrong foot, I still believe Google is best positioned to run away with the AI lead, solely because they are not beholden to Nvidia and not stuck with a 3rd party cloud provider. They are the only AI team that is top to bottom in-house.
- Squarex 5 months ago
  
  I've used gemini for it's large context window before. It's a great model. But specifically in this benchmark it has always scored very low. So I wonder what has changed.
  
  1 reply →
- abixb 5 months ago
  
  We should still wait around to see if Huawei is able to perfect its Ascend series for training and inferencing SOTA models.
- ripped_britches 5 months ago
  
  This is a great take
manmal 5 months ago
Gemini 2 is really good, and insanely fast.
- irjustin 5 months ago
  
  It's also insanely cheap.
- Squarex 5 months ago
  
  It is, but in this benchmark gemini scored very poorly in the past.

guelo 5 months ago

How did you chose the 8192 token thinking budget? I've often seen Deepseek R1 use way more than that.

freediver 5 months ago

Arbitrary, and even with this budget it is already more verbose (and slower) overall than all the other thinking models - check tokens and latency in the table.

baobabKoodaa 5 months ago

I see it in Kagi Assistant already and it's not even 24 hours! Nice.

KTibow 5 months ago

One thing I don't understand is why Claude 3.5 Haiku, a non thinking model in the non thinking section, says it has a 8192 thinking budget.

flixing 5 months ago

Do you think kagi is the right Eval tool? If so,why?

freediver 5 months ago

The right eval tool depends on your evaluation task. Kagi LLM benchmark focuses on using LLMS in the context of information retrieval (which is what Kagi does) which includes measuring reasoning and instruction following capabilities.

thefourthchime 5 months ago

Nice, but where is Grok?

pertymcpert 5 months ago

Perhaps they're waiting for the Grok API to be public?

Kye 5 months ago

I thought o3-mini was o1-mini. OpenAI's naming gets confusing.