Comment by reissbaker

12 hours ago

I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:

1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.

2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.

Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.

Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.

1: https://github.com/synthetic-lab/octofriend

29 comments

reissbaker

UncleOxidant 12 hours ago

I just read your comment and decided to give GLM-4.5 a try in Kilocode. I'd been using Gemini CLI all day to try to resolve a tricky bug in some compiler code (a compiler for a subset of C that generates microcode for... a weird architecture, I'll leave it at that). So GLM-4.5 zoomed in on the problem right away. A problem that's eluded Gemini CLI all day. Gemini was leading me on a wild goose chase implicating a function that turns out wasn't the problem (and trying to make all kinds of lame changes to the function saying that would fix the problem - and it never did because the problem wasn't that function).

p4coder 8 hours ago
Sometimes getting a second pair of eyes to look at the problem helps and is usually not a judgement of smartness of the first pair of eyes. Seems like it also applies to coding agents.
- diggan 6 hours ago
  
  Indeed, I've also found that various models are good at various tasks, but I have yet been able to categorize "Model X is good at Y-class of bugs", so I end up using N models for a first pass "Find the root-cause of this issue", then once it's found, pass it along to same N models for them to attempt to solve it.
  So far, which model can find/solve what is really scattered all over the place.
  
  2 replies →
- viraptor 4 hours ago
  
  The good old regression to the mean. Testing models as the second pair of eyes only when the first fails is going to give weird results... https://www.smbc-comics.com/comic/protocol
faangguyindia 8 hours ago

Gemini CLI uses whole file edit format and goes through the token very fast. I use aider for this reason with diff fenced, it burns very less tokens.
3abiton 6 hours ago

I am curious about your setup? Is it just gemini cli? Or are you combining it with other frameworks?

nmfisher 6 hours ago

I've had similarly good experiences with GLM-4.5 for smaller projects/requests. Unfortunately that did degrade with larger contexts, so I'm still treating it as a good fallback for Sonnet 4, rather than a full-blown replacement.

faangguyindia 8 hours ago

I've been using architect mode in aider

Deepseek R1 (does high level planning) combined with Qwen3 480B (does low level coding) or whatever is available from qwen code apis.

It's working great.

It solves 99.99% problem on tis own.

The seperation isn't very good in aider so i later plan to make my own tool to achieve better workflow.

manmal 40 minutes ago

What’s your monthly bill (OpenRouter?) if I may ask? I have Claude Max and always on the lookout for alternatives, at least for the easier to solve problems.

mrklol 6 hours ago

About your first point, I also feel like Claude is better if there’s more in the context where 4.5 is getting "worse".

faangguyindia 4 hours ago

claude used to be better not anymore or atleast the difference is not that much
deepseek r1+ qwen3 is close enough along with gemini2.5 pro
so i don't see any point of claude anymore

nico 12 hours ago

How are you using glm-4.5? Are you consuming the api or running something like glm-4.5 air locally?

reissbaker 11 hours ago
I run a privacy-focused inference company, Synthetic [1], and I use our API of course :P I actually like GLM-4.5 enough that it's currently our default recommended model for new users. But yes, otherwise I'd use the official zai API most likely, or Fireworks. GLM-4.5-Air is quite good for a local model but GLM-4.5 is better; up to you if the tradeoff is worth it — there's definitely value in the data not ever leaving your machine, but it's not going to be as strong of a model.
1: https://synthetic.new
- fariszr 5 hours ago
  
  What makes your service especially privacy friendly?
  I think if you are striving for full privacy, you should implement the secure enclave idea presented by ollama, it makes the entire pipeline fully encrypted, I'm waiting for an actual provider to finally implement this.
  https://ollama.com/blog/secureminions
  
  5 replies →
- throwdbaaway 5 hours ago
  
  You support logprobs, that's wonderful! Fireworks, Synthetic, (ik_)llama.cpp, now I have a quorum.
- azinman2 9 hours ago
  
  I’m curious for your service, if it’s centered around privacy, why is the data stored for 14 days at all? My understanding with fireworks is that it’s 0 logging — nothing to store. To me that’s private.
  
  1 reply →
- mrtesthah 7 hours ago
  
  Amazing! So I’m assuming that because it’s privacy focused, you accept payment in cryptocurrencies like Monero and Zcash?
  
  1 reply →
sagarpatil 10 hours ago
Not OP. Chutes.ai charges $0.20 per 1M tokens. I don’t think it uses caching though because I ended up burning $30 in an hour or two. I had to move back to Claude Code.
- esafak 10 hours ago
  
  Caching makes price comparisons hard. Does anyone have tips?