Opus and Codex are both catching plenty of very good improvements to my GLM plans. It gets a lot right too, has a lot of good things it does, good habits and practices. But it's not as smart, not as observant, not as able to craft a nice system. In my experience.
Opus and Codex are both catching plenty of very good improvements to my GLM plans. It gets a lot right too, has a lot of good things it does, good habits and practices. But it's not as smart, not as observant, not as able to craft a nice system. In my experience.
How do you qualify what makes a model "Mythos class", and how do you reliably test for it?
Presumably a deepswe benchmark, which IIRC puts GLM 5.2 between opus 4.8 and fable.