Comment by riotnrrd
7 days ago
One of the difficulties -- and one that is currently a big problem in LLM research -- is that comparisons with or evaluations of commercial models are very expensive. I co-wrote a paper recently and we spent more than $10,000 on various SOTA commercial models in order to evaluate our research. We could easily (an cheaply) show that we were much better than open-weight models, but we knew that reviewers would ding us if we didn't compare to "the best."
Even aside from the expense (which penalizes universities and smaller labs), I feel it's a bad idea to require academic research to compare itself to opaque commercial offerings. We have very little detail on what's really happening when OpenAI for example does inference. And their technology stack and model can change at any time, and users won't know unless they carefully re-benchmark ($$$) every time you use the model. I feel that academic journals should discourage comparisons to commercial models, unless we have very precise information about the architecture, engineering stack, and training data they use.
you have the separate the model , from the interface, imho.
you can totally evaluate these as GUI's, and CLI's and TUI's with more or less features and connectors.
Model quality is about benchmarks.
aider is great at showing benchmarks for their users
gemini-cli now tells you % of correct tools ending a session