Comment by vidarh
10 hours ago
Harness certainly matters a lot, though GLM is pretty forgiving. I just had Opus tell me that based on numbers over the last week, from quite a few billion tokens total across half a dozen providers, GLM 5.1 has been more reliable for one of my projects than Sonnet... Just switching on 5.2 now.
How are you collecting your metrics on token usage and reliability?
They are from my own runs, with reliability measured in terms of passing extensive test suites. So caveat is that this applies for my specific use and might well vary greatly.