Comment by thimabi
2 months ago
It’s been hard to keep up with the evolution in LLMs. SOTA models basically change every other week, and each of them has its own quirks.
Differences in features, personality, output formatting, UI, safety filters… make it nearly impossible to migrate workflows between distinct LLMs. Even models of the same family exhibit strikingly different behaviors in response to the same prompt.
Still, having to find each model’s strengths and weaknesses on my own is certainly much better than not seeing any progress in the field. I just hope that, eventually, LLM providers converge on a similar set of features and behaviors for their models.
My advice: don't jump around between LLMs for a given project. The AI space is progressing too rapidly right now. Save yourself the sanity.
A man with one compass knows where he's going; a man with two compasses is never sure.
Isn't that an argument to jump around? Since performance improves so rapidly between models
I think the idea is you might end up spending your time shaving a yak. Finish your project, then try the ne SOTA on your next task.
But it's also churning: I think it's more in the direction of you'll be more productive with a setup you've learnt the quirks of than the newest one which you haven't.
Each model has their own strength and weaknesses tho. You really shouldn’t be using one model for everything. Like, Claude is great at coding but is expensive so you wouldn’t use them for debugging to writing test benches. But the OpenAI models suck at architecture but are cheap, so are ideal for test benches, for example.
You did not read what I said:
> don't jump around between LLMs for a given project
I didn't say anything about sticking to a single model for every project.
You should at least have two to sanity check difficult programming solutions.
How important is it to be using SOTA? Or even jump on it already?
Feels a bit like when it was a new frontend framework every week. Didn't jump on any then. Sure, when React was the winner, I had a few months less experience than those who bet on the correct horse. But nothing I couldn't quickly catch up to.
> How important is it to be using SOTA?
I believe in using the best model for each use case. Since I’m paying for it, I like to find out which model is the best bang for my buck.
The problem is that, even when comparing models according to different use cases, better models eventually appear, and the models one uses eventually change as well — for better or worse. This means that using the same model over and over doesn’t seem like a good decision.
Vibe code an eval harness with a web dashboard
Have you tried a package like LiteLLM so that you can more easily validate and switch to a newer model?
The key seems to be in curating your application's evaluation set.
I'd love something like litellm, but simpler. I'm not provisioning models for my organization, I don't need to granularly track spend, I just want one endpoint to point every tool or client at for ease of configuration and curiosity around usage.