Whenever you present a table with sorting ability you might as well make the first click ascending or descending according to what makes the most sense for that column. For example I'm highly unlikely to be interested in which model has the smallest context window, but it's always two clicks to find which one has the highest.
Sorting null values first isn't very useful either.
Do you think it is demonstrably better than Sonnet? Grabbed a pro sub last month shortly after the cli tool dropped, but have not used it past couple weeks because I found myself spending way more time correcting it than getting useful output
You can consider the o3/o4-mini price to be half that due to flex processing. Flex gives the benefits of the batch API without the downside of waiting for a response. It's not marketed that way but that is my experience. With 20% cache hits I'm averaging around $0.8/million input tokens and $4/million output tokens.
I’m shocked people are signing up to pay even these fees to build presumably CRUD apps. I feel a complete divergence in the profession between people who use this and who don’t.
Do you use them for code generation? I am simply using copilot as $10/mo is a reasonable budget...but quick guesses based on my use, would put code generation via an API at potentially $10/day?
o3 is a unique model. For difficult math problems, it generates long reasoning traces (e.g. 10-20k tokens). For coding questions, the reasoning tokens are consistently small. Unlike Gemini 2.5 Pro, which generates longer reasoning traces for coding questions.
Cost for o3 code generation is therefore driven primarily by context size. If your programming questions have short contexts, then o3 API with flex is really cost effective.
For 30k input tokens and 3k output tokens, the cost is 30000 * 0.8 / 1000000 + 3000 * 4 / 1000000 = $0.036
But if you have contexts between 100k-200k, then the monthly plans that give you a budget of prompts instead of tokens are probably going to be cheaper.
For a start you don't ask such subjective questions, that's a bit silly, instead you ask for e.g. the death toll of Israel vs Palestine in the last year, the number of deaths surrounding the tianammen square protests, if it gives you a straight answers with numbers (or at least a consistent estimate) and citing it's sources it's a good start.
Surprised to find out grok 3 mini is so economic and ranks higher than equivalent gpt models. I run most of my agents on gpt4.1 mini, might switch now
[flagged]
For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:
> Humanity's Last Exam (Reasoning & Knowledge)
Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.
Interesting to learn that o4-mini-high has the highest intelligence/$ score here at par with o3-pro which is twice as expensive and slow.
Whenever you present a table with sorting ability you might as well make the first click ascending or descending according to what makes the most sense for that column. For example I'm highly unlikely to be interested in which model has the smallest context window, but it's always two clicks to find which one has the highest.
Sorting null values first isn't very useful either.
Vibe coded websites be like
Not necessarily vibe coded. Sometimes developers don't actually care about the product, and just want to get it over with.
2 replies →
Look at that bar graph comparing the price of every model compared to Claude Opus
It's a shame it's so good for coding
https://artificialanalysis.ai/models/claude-4-opus-thinking/...
I’ve had very mixed results with 4 Opus. It’s still just a language model and can’t understand some basic concepts.
Do you think it is demonstrably better than Sonnet? Grabbed a pro sub last month shortly after the cli tool dropped, but have not used it past couple weeks because I found myself spending way more time correcting it than getting useful output
Here's my plot based on Aider benchmarks
https://www.linkedin.com/posts/panela_important-plot-for-fol...
You can consider the o3/o4-mini price to be half that due to flex processing. Flex gives the benefits of the batch API without the downside of waiting for a response. It's not marketed that way but that is my experience. With 20% cache hits I'm averaging around $0.8/million input tokens and $4/million output tokens.
I’m shocked people are signing up to pay even these fees to build presumably CRUD apps. I feel a complete divergence in the profession between people who use this and who don’t.
A whole codebase of 100k lines (~1M tokens) for ~a dollar. Would like to understand why would signing up for this be shocking?
5 replies →
Some people are struggling to build CRUDs.
Do you use them for code generation? I am simply using copilot as $10/mo is a reasonable budget...but quick guesses based on my use, would put code generation via an API at potentially $10/day?
o3 is a unique model. For difficult math problems, it generates long reasoning traces (e.g. 10-20k tokens). For coding questions, the reasoning tokens are consistently small. Unlike Gemini 2.5 Pro, which generates longer reasoning traces for coding questions.
Cost for o3 code generation is therefore driven primarily by context size. If your programming questions have short contexts, then o3 API with flex is really cost effective.
For 30k input tokens and 3k output tokens, the cost is 30000 * 0.8 / 1000000 + 3000 * 4 / 1000000 = $0.036
But if you have contexts between 100k-200k, then the monthly plans that give you a budget of prompts instead of tokens are probably going to be cheaper.
Is there an option to filter the list based on the measurements? I.e "context window > X, intelligence > Y, price < Z"? I'd love that.
It seems the only filter options available are unrelated to the measured metrics.
(I might have missing this since the UI is a bit cluttered.)
Related:
Benchmarks and comparison of LLM AI models and API hosting providers - https://news.ycombinator.com/item?id=39014985 - Jan 2024 (70 comments)
It is interesting that it ranks `GPT-4.1 mini` higher than `GPT-4.1` (the latter costing five times more).
[dead]
[flagged]
[dead]
How about adding a freedom measurement in those columns?
Impossible to be objective on what that means. I can see having a "baggage" field that lists non performance-related concerns for each.
Is there an index for judging how much a model distorts the truth in order to comply with a political agenda?
It's not perfect, but, yes: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
How would you create the base "truth" for these models? People are adamant about both sides of many topics.
"Which country started the Korean war?", "Did Israel genocide the people of Gaza?", "Does China have lawful rights over Taiwan?"
For a start you don't ask such subjective questions, that's a bit silly, instead you ask for e.g. the death toll of Israel vs Palestine in the last year, the number of deaths surrounding the tianammen square protests, if it gives you a straight answers with numbers (or at least a consistent estimate) and citing it's sources it's a good start.
4 replies →
Hopefully obviously, by testing it against objective facts which are nonetheless "controversial" politically.
1 reply →