Comment by handsclean
3 hours ago
Please consider changing pass/fail to an integer score out of maybe 5. This test is becoming more and more misleading as your apparent desire to give due credit conflicts with quality improvements over already ok-ish models. For example, on the great wave Gemini 3’s excellent rendition gets no additional credit over Qwen technically not failing if one is generous, and on cards, there’s actually no score distinction between results that one could or could not use.
No comments yet
Contribute on Hacker News ↗