← Back to context

Comment by esafak

6 days ago

What do you mean? It tests whether the model knows the tools and uses them.

Yeah it's a knowledge benchmark not agentic benchmark.

  • That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

    • I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

      For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.