Comment by hustleracer

5 days ago

Really interesting approach to structured model comparison.

The debate round feature is the most compelling part — seeing which models change their position when exposed to other reasoning is more revealing than just the initial answer.

One thing I'd be curious to test: how consistently different models evaluate whether a given task aligns with a stated mission or vision. My intuition is there'd be wide variance, which would say something interesting about how reliable LLM-as-a-judge actually is for goal alignment scoring.