Comment by paradite

7 months ago

The performance not only depends on the tool, it also depends on the model, and the codebase you are working on (context), and the task given (prompt).

And all these factors are not independent. Some combinations work better than others. For example:

- Claude Sonnet 4 might work well with feature implementation, on backend code python code using Claude Code.

- Gemini 2.5 Pro works better for big fixes on frontend react codebases.

...

So you can't just test the tools alone and keep everything else constant. Instead you get a combinatorial explosion of tool * model * context * prompt to test.

16x Eval can tackle parts of the problem, but it doesn't cover factors like tools yet.