Comment by petesergeant
5 days ago
I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think
Yeah, I think personalized evals will definitely be a thing. Besides reviewing way too much Arena, WildChat and having now seen lots of live traces firsthand, there's a wide range of LLM usage (and preferences), which really don't match my own tastes or requirements, lol.
For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...
I think its more likely that we move away from benchmarks and towards more of a traditional reviewer model. People will find LLM influencers whose takes they agree with and follow them to keep up with new models.