Comment by zone411
2 days ago
I benchmarked it on four of my benchmarks so far. Got first place in two of them:
https://github.com/lechmazur/confabulations
https://github.com/lechmazur/nyt-connections
2 days ago
I benchmarked it on four of my benchmarks so far. Got first place in two of them:
https://github.com/lechmazur/confabulations
https://github.com/lechmazur/nyt-connections
It seems like you often have LLMs grading each other. Aren’t you concerned that some models may not be “smart” enough to grade a smarter model appropriately?
Using LLMs for evaluating LLMs is incredibly common.
The point isn't in having a "perfect" evaluator, but in having a cheap and somewhat consistent evaluator.
This approach holds up well enough... as long as you don't try to use it for RL. If you do, chances are, you'll end up with an adversarial LLM that aims solely for breaking and saturating the evaluator.
But I feel like the evaluator should generally be stronger/better than what its evaluating. Otherwise you risk it evaluating at a lower level, while the better LLM is writing with more nuance that the lower LLM doesn't pick up on.
I've seen some places, e.g., NY Times, use expert panels to review the results from LLMs. For example, getting the author of a book/essay to evaluate how well the LLM summarizes and answers questions about the book/essay. While it's not scalable, it does seem like it will better evaluate cutting edge models.
I’m not sure I would use “consistent” to characterize LLMs