Comment by _peregrine_

2 months ago

Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.

29 comments

_peregrine_

It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?

riwsky 2 months ago

Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.

stadeschuldt 2 months ago

Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.

_peregrine_ 2 months ago

yeah that surprised me too

Workaccount2 2 months ago

This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.

_peregrine_ 2 months ago

Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.
There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

ineedaj0b 2 months ago

i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.

timmytokyo 2 months ago
Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.
- veidr 2 months ago
  
  For real, though.
  Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?

gkfasdfasdf 2 months ago

Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).

zarathustreal 2 months ago

“Your model has memorized all knowledge, how do you know it’s smart?”

sagarpatil 2 months ago

Sonnet 3.7 > Sonnet 4? Interesting.

dcreater 2 months ago

How does Qwen3 do on this benchmark?

mritchie712 2 months ago

looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

_peregrine_ 2 months ago

Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.

jpau 2 months ago

Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

_peregrine_ 2 months ago

No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"

XCSme 2 months ago

That's a really useful benchmark, could you add 4.1-mini?

_peregrine_ 2 months ago

Yeah we're always looking for new models to add

jjwiseman 2 months ago

Please add GPT o3.

_peregrine_ 2 months ago

Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark

varunneal 2 months ago

Why is o3-mini there but not o3?

_peregrine_ 2 months ago

We should definitely add o3 - probably will soon. Also looking at testing the Qwen models

joelthelion 2 months ago

Did you try Sonnet 4?

vladimirralev 2 months ago
It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.
- _peregrine_ 2 months ago
  
  yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)
- TacticalCoder 2 months ago
  
  [dead]

kadushka 2 months ago

what about o3?

_peregrine_ 2 months ago

We need to add it