Comment by alex7o
19 hours ago
Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.
Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.
It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.
Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.
2 replies →
And the subjectivity is bidirectional.
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
AI is a complete commodity
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
Hmm
Will try it out. Thanks for sharing!
What is your workflow? Do you use Cursor or another tool for code Gen?
I use Opencode, both directly and through Discord via a little bridge called Kimaki.
https://github.com/remorses/kimaki
The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.
> The value in Claude Code is its harness
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
7 replies →
I thought the desktop app used the cli app in the background?
I have been using GLM-5.1 with pi.dev through Ollama Cloud for my personal projects and I am very happy with this setup. I use pi.dev with Claude Sonnet/Opus 4.6 at work. Claude Code is great but the latest update has me compacting so much more frequently I could not stand it. I don't miss MCP tool calling when I am using pi.dev; it uses APIs just fine. I actually think GML-5.1 builds better websites than Claude Opus. For my personal projects I am building a full stack development platform and GLM-5.1 is doing a fantastic job.
I'm using pi the same as you. However, I have an MCP I need to use and the popular extension for that support works fine for me.
Really liking pi and glm 5.1!
Why use ollama cloud versus like Openrouter?
The limits seem higher on Ollama Cloud to me than paying for API access. I don't have solid stats on that though. I have an OpenRouter account and the service I am creating is going to need to use that. I will have better measuring stick then.
Recently it had great limits but this month I'm trying open router directly.
The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.
> I've tried using qwen and deepseek but they can't even output documents
What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.
Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.
11 replies →
You can make a harness fully functional with just the "shell_exec" tool if you give it access to a linux/unix environment + playwright cli.
When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.
I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.
2 replies →
You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.
You can just use Cline in VSCode to get most of the tooling you need - it works with all models. Including Xiaomi's new Mimo with 1m context window and blazing fast speed. It's much cheaper than Claude's biggest plan and with much, much more quota.
Yep Claude Code CLI does A LOT (which is now confirmed even more)
qwen3.5 and qwen3.6 are both good at tool calling.
I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.
You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.
I wonder why glm is viewed so positively.
Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.
I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.
The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.
> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.
Yes, but... isn't the same true for Opus and all the other models too?
6 replies →
I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.
You have to keep it below ~100 000 token, else it gets funny in the head.
I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.
EDIT: Ok, now I tried GLM for the first time in the morning CET, and it was .. bad. The reasoning took 5 mintues for a very very small .html file going around in circles.
Evening CET experience for me is super smooth.
That's unfortunate. 70-80k tokens is roughly the point where I start wrapping up with giving agent required context even on the small to medium sized requests.
That would leave almost no tokens for actual work
GLM is the first open source model that actually worked for me, where I found the output ok.
And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.
IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does
I think it offers a very good tradeoff of cost vs competency
4.7 is better, but its also wildly expensive
You're probably just holding it wrong.
Opus 4.6 was incredible but Opus 4.7 is genuinely frustrating to me so far. It's really sharp but can be so lazy. It's constantly telling me that we should save this for tomorrow, that it's time for bed (in the middle of the day), and very often quite sloppy and bold in its action. These adjustments are getting old. The next crop of open models seems ready to practically replace the big ones as sharp orchestrator agents.
I have never seen a model be “lazy” before (I have seen them go for minimal change). I have been using the models through the api with various agents and no custom system prompt.
So I am curious, how do people get these lazy outputs?
Is it by having one of those custom system prompts that basically tells the model to be disrespectful?
Or is it free tier?
Cheap plans?
I have seen some people complain about a new tendency where it can suggest wrapping up the current task even though it isn't done yet. I haven't seen it myself though.
[dead]
The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.
Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.
I was very impressed.
I think claude in general, writes very lazy, poor quality code, but it writes code that works in fewer iterations. This could be one of the reasons behind it's popularity - it pushes towards the end faster at all costs.
Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.
Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.
Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.
Consider that SWE benchmarking is mainly done with python code. It tells something
I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin
Do you use Opus through the API or with subscription? Did you use OpenCode or Code?
Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.
If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.
Not to mention, that Opus cost orders of magnitude more money. These are VERY impressive and usage.
FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.
Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)
I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.
If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"
I see this with Opus too.
1 reply →
> "Actually…" or "But wait!"
You’re absolutely right!
Jokes apart, I did notice GLM doing these back and forth loops.
3 replies →
Z.ai’s cloud offering is poor, try it with a different provider.
could you add some context for why you think it's poor?
Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.
>but I have seen the local 122b model do smarter more correct things based on docs than opus
Could you please share more about this
Maybe a bit misleading. I have used in in two places.
One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.
Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file
Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.
As so many things these days: It's a cult.
I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6
But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"
For me it's just a tool, so I shrug.
> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
I wonder about this. I see two obvious possibilities (if we ignore bias):
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
6 replies →
I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).
Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).
But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.
But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.
Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...
In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.
But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.
When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.
I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
Well summarized.
We're also seeing that the people up top are using this to cull the herd.
What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
Who controls that need for you?
All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.
All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.
Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.
Allow me to introduce you to Buddhism
10 replies →
Dogmatism is a spectrum and for too many people it's on the animal side of the scale.
I wonder to what degree it depends on how easy you find coding in general. I find for the early steps genAI is great to get the ball rolling, but rapidly it becomes more work to explain what it did wrong and how to fix it (and repeat until it does so) than to just fix the code myself.
Yes, this and also taste. What might be perfectly fine for one developer is an abomination for another who can spot the problems with it.
I think in every domain, the better you are the less useful you find AI.
[dead]