Comment by hodgehog11

2 days ago

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

30 comments

hodgehog11

segmondy 2 days ago

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

paradite 2 days ago

Hey. I like your roast on benchmarks.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
https://eval.16x.engineer/evals/coding
jstummbillig 1 day ago

Which benchmarks are not garbage?
I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.

guluarte 2 days ago

tbh companies like anthopic, openai, create custom agents for specific benchmarks

bazmattaz 2 days ago
Do you have a source for this? I’m intrigued
- guluarte 2 days ago
  
  https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."
amelius 2 days ago
Aren't good benchmarks supposed to be secret?
- wkat4242 2 days ago
  
  This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.
- noodletheworld 2 days ago
  
  How can a benchmark be secret if you post it to an API to test a model on it?
  "We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
  :P
  If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
  So no.
  They're not secret.
  
  7 replies →

YetAnotherNick 2 days ago

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

coliveira 2 days ago

My personal experience is that it produces high quality results.

amrrs 2 days ago
Any example or prompt you use to make this statment?
- imachine1980_ 2 days ago
  
  I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.
  
  2 replies →
- sync 2 days ago
  
  I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.
  
  2 replies →
SV_BubbleTime 2 days ago

Vine is about the only benchmark I think is real.
We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?

seunosewa 2 days ago

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

yorwba 2 days ago
Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

tonyhart7 2 days ago

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank