← Back to context

Comment by YetAnotherNick

7 days ago

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

4 comments

YetAnotherNick

Reply

esafak 6 days ago

What do you mean? It tests whether the model knows the tools and uses them.

YetAnotherNick 6 days ago
Yeah it's a knowledge benchmark not agentic benchmark.
- esafak 6 days ago
  
  That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.
  
  1 reply →