Comment by mikenew

12 hours ago

GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.

20 comments

mikenew

operatingthetan 12 hours ago

It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.

fwipsy 12 hours ago
Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.
- easygenes 7 hours ago
  
  Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).
  The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
  All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
- operatingthetan 11 hours ago
  
  >just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.
  You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.
hamdingers 10 hours ago

And the subjectivity is bidirectional.
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
ulfw 5 hours ago
AI is a complete commodity
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
- StilesCrisis 3 minutes ago
  
  Staying power is probably the most important factor, which is why I'm thinking Google eventually takes the crown.

mettamage 3 hours ago

Hmm

Will try it out. Thanks for sharing!

abustamam 11 hours ago

What is your workflow? Do you use Cursor or another tool for code Gen?

mikenew 5 hours ago

I use Opencode, both directly and through Discord via a little bridge called Kimaki.
https://github.com/remorses/kimaki

LoganDark 11 hours ago

The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.

deaux 9 hours ago
> The value in Claude Code is its harness
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
- enochthered 9 hours ago
  
  Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.
  
  4 replies →
- bizzletk 7 hours ago
  
  Can you enumerate why?
  
  1 reply →
Mashimo 5 hours ago

I thought the desktop app used the cli app in the background?