Comment by LiamPowell
12 hours ago
This is not actually what the reviewer prompt says, or perhaps it is, I don't know since they don't make it public. I'm just pointing out how it seems like a bad idea to ask a LLM to make a subjective judgement on things like "taste". If the SOTA LLM witting the code could not produce tasteful code then why would a different LLM be able to judge the "taste" of that code?
Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.
> Is it giving an unfair advantage to Model X if we use Model X as the judge?
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3