← Back to context

Comment by naklitechie

14 hours ago

Looks like it's about a year behind. Not that I am complaining. A year behind is good progress.

I also feel much of the trick is in the reasoning and harness.

so some progress around that would accelerate this process.

Harness certainly matters a lot, though GLM is pretty forgiving. I just had Opus tell me that based on numbers over the last week, from quite a few billion tokens total across half a dozen providers, GLM 5.1 has been more reliable for one of my projects than Sonnet... Just switching on 5.2 now.

  • How are you collecting your metrics on token usage and reliability?

    • They are from my own runs, with reliability measured in terms of passing extensive test suites. So caveat is that this applies for my specific use and might well vary greatly.

And what do you base this on ?

How does one objectively quantify how it stacks upnto another model ?

Or even, what is your subjective evaluation based on ?

I really wonder - because I have just finished a fully vibe-coded gtk/rust/lua application with me basically writing 7% of the code (all in one module) and GLM 5.1 writing the rest. We haven’t had regressions, confusion or anything else. And I am pretty damned sure I couldn’t manage this one year ago with claude code and Sonnet.