Comment by liamconnell
7 hours ago
> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.
No comments yet
Contribute on Hacker News ↗