Comment by liamconnell

1 month ago

> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.

Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”