Comment by cristea

7 days ago

I would love a comparison between all these new tools, like this with Claude Code, opencode, aider and cortex.

I just can’t get an easy overview of how each tool works and is different

One of the difficulties -- and one that is currently a big problem in LLM research -- is that comparisons with or evaluations of commercial models are very expensive. I co-wrote a paper recently and we spent more than $10,000 on various SOTA commercial models in order to evaluate our research. We could easily (an cheaply) show that we were much better than open-weight models, but we knew that reviewers would ding us if we didn't compare to "the best."

Even aside from the expense (which penalizes universities and smaller labs), I feel it's a bad idea to require academic research to compare itself to opaque commercial offerings. We have very little detail on what's really happening when OpenAI for example does inference. And their technology stack and model can change at any time, and users won't know unless they carefully re-benchmark ($$$) every time you use the model. I feel that academic journals should discourage comparisons to commercial models, unless we have very precise information about the architecture, engineering stack, and training data they use.

  • you have the separate the model , from the interface, imho.

    you can totally evaluate these as GUI's, and CLI's and TUI's with more or less features and connectors.

    Model quality is about benchmarks.

    aider is great at showing benchmarks for their users

    gemini-cli now tells you % of correct tools ending a session

This used to be opencode but was renamed after some fallout between the devs I think.

  • If anyone is curious on the context:

    https://x.com/thdxr/status/1933561254481666466 https://x.com/meowgorithm/status/1933593074820891062 https://www.youtube.com/watch?v=qCJBbVJ_wP0

    Gemini summary of the above:

    - Kujtim Hoxha creates a project named TermAI using open-source libraries from the company Charm.

    - Two other developers, Dax (a well-known internet personality and developer) and Adam (a developer and co-founder of Chef, known for his work on open-source and developer tools), join the project.

    - They rebrand it to OpenCode, with Dax buying the domain and both heavily promoting it and improving the UI/UX.

    - The project rapidly gains popularity and GitHub stars, largely due to Dax and Adam's influence and contributions.

    - Charm, the company behind the original libraries, offers Kujtim a full-time role to continue working on the project, effectively acqui-hiring him.

    - Kujtim accepts the offer. As the original owner of the GitHub repository, he moves the project and its stars to Charm's organization. Dax and Adam object, not wanting the community project to be owned by a VC-backed company.

    - Allegations surface that Charm rewrote git history to remove Dax's commits, banned Adam from the repo, and deleted comments that were critical of the move.

    - Dax and Adam, who own the opencode.ai domain and claim ownership of the brand they created, fork the original repo and launch their own version under the OpenCode name.

    - For a time, two competing projects named OpenCode exist, causing significant community confusion.

    - Following the public backlash, Charm eventually renames its version to Crush, ceding the OpenCode name to the project now maintained by Dax and Adam.

The performance not only depends on the tool, it also depends on the model, and the codebase you are working on (context), and the task given (prompt).

And all these factors are not independent. Some combinations work better than others. For example:

- Claude Sonnet 4 might work well with feature implementation, on backend code python code using Claude Code.

- Gemini 2.5 Pro works better for big fixes on frontend react codebases.

...

So you can't just test the tools alone and keep everything else constant. Instead you get a combinatorial explosion of tool * model * context * prompt to test.

16x Eval can tackle parts of the problem, but it doesn't cover factors like tools yet.

https://eval.16x.engineer/