LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others

3 months ago (artificialanalysis.ai)

For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:

    MMLU-Pro (Reasoning & Knowledge)  
    GPQA Diamond (Scientific Reasoning)  
    Humanity's Last Exam (Reasoning & Knowledge)  
    LiveCodeBench (Coding)  
    SciCode (Coding)  
    HumanEval (Coding)  
    MATH-500 (Quantitative Reasoning)  
    AIME 2024 (Competition Math)  
    Chatbot Arena  (selectively used)

  • > Humanity's Last Exam (Reasoning & Knowledge)

    Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.

Interesting to learn that o4-mini-high has the highest intelligence/$ score here at par with o3-pro which is twice as expensive and slow.

Whenever you present a table with sorting ability you might as well make the first click ascending or descending according to what makes the most sense for that column. For example I'm highly unlikely to be interested in which model has the smallest context window, but it's always two clicks to find which one has the highest.

Sorting null values first isn't very useful either.

Look at that bar graph comparing the price of every model compared to Claude Opus

It's a shame it's so good for coding

https://artificialanalysis.ai/models/claude-4-opus-thinking/...

  • I’ve had very mixed results with 4 Opus. It’s still just a language model and can’t understand some basic concepts.

  • Do you think it is demonstrably better than Sonnet? Grabbed a pro sub last month shortly after the cli tool dropped, but have not used it past couple weeks because I found myself spending way more time correcting it than getting useful output

You can consider the o3/o4-mini price to be half that due to flex processing. Flex gives the benefits of the batch API without the downside of waiting for a response. It's not marketed that way but that is my experience. With 20% cache hits I'm averaging around $0.8/million input tokens and $4/million output tokens.

  • Do you use them for code generation? I am simply using copilot as $10/mo is a reasonable budget...but quick guesses based on my use, would put code generation via an API at potentially $10/day?

    • o3 is a unique model. For difficult math problems, it generates long reasoning traces (e.g. 10-20k tokens). For coding questions, the reasoning tokens are consistently small. Unlike Gemini 2.5 Pro, which generates longer reasoning traces for coding questions.

      Cost for o3 code generation is therefore driven primarily by context size. If your programming questions have short contexts, then o3 API with flex is really cost effective.

      For 30k input tokens and 3k output tokens, the cost is 30000 * 0.8 / 1000000 + 3000 * 4 / 1000000 = $0.036

      But if you have contexts between 100k-200k, then the monthly plans that give you a budget of prompts instead of tokens are probably going to be cheaper.

Is there an option to filter the list based on the measurements? I.e "context window > X, intelligence > Y, price < Z"? I'd love that.

It seems the only filter options available are unrelated to the measured metrics.

(I might have missing this since the UI is a bit cluttered.)

It is interesting that it ranks `GPT-4.1 mini` higher than `GPT-4.1` (the latter costing five times more).

How about adding a freedom measurement in those columns?

  • Impossible to be objective on what that means. I can see having a "baggage" field that lists non performance-related concerns for each.

Is there an index for judging how much a model distorts the truth in order to comply with a political agenda?