← Back to context Comment by wavemode 2 days ago He tests several Claude versions as well 1 comment wavemode Reply causal 2 days ago Ah you're right, scrolled past that - the most salient contrast in the chart is still just GPT-5 vs GPT-4, and it feels easy to contrive such results by pinning one model's response as "ideal" and making that a benchmark for everything else.
causal 2 days ago Ah you're right, scrolled past that - the most salient contrast in the chart is still just GPT-5 vs GPT-4, and it feels easy to contrive such results by pinning one model's response as "ideal" and making that a benchmark for everything else.
Ah you're right, scrolled past that - the most salient contrast in the chart is still just GPT-5 vs GPT-4, and it feels easy to contrive such results by pinning one model's response as "ideal" and making that a benchmark for everything else.