Comment by amunozo

9 hours ago

A bit skeptical about a 27B model comparable to opus...

24 comments

amunozo

For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

cbg0 7 hours ago
If all you're looking at is benchmarks that might be true, but those are way too easy to game. Try using this model alongside Opus for some work in Rust/C++ and it'll be night and day. You really can't compare a model that's got trillions of parameters to a 27B one.
- otabdeveloper4 6 hours ago
  
  > ...and it'll be night and day.
  That's just, like, your opinion, man.
  > You really can't compare a model that's got trillions of parameters to a 27B one.
  Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.
  
  1 reply →

kgeist 3 hours ago

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

rubiquity 8 hours ago

You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.

bityard 7 hours ago
Are you using the same agent/harness/whatever for both Claude and Qwen, or something different for each one?
- rubiquity 7 hours ago
  
  I use Pi at home and Claude Code at work (no choice). I use bone stock Pi. No extensions.

Aurornis 8 hours ago

You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

jjcm 8 hours ago

Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.

esafak 8 hours ago

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

NitpickLawyer 6 hours ago

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.
underlines 8 hours ago

well, your own, unleaked ones, representing your real workloads.
if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.
cbg0 7 hours ago

None. Try them out with your own typical tasks to see the performance.
WarmWash 8 hours ago

ARC-AGI 2
GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

wesammikhail 8 hours ago

you'd be surprised how good small models have gotten. Size of the model isnt all that matters.

freedomben 8 hours ago

Plus you can control thinking time a lot more, so when Anthropic lobotomizes Opus on you...
verdverm 8 hours ago
My experience with qwen-3.6:35B-A3B reinforces this, gonna give this a spin when unsloth has quants available
Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.
- cassianoleal 8 hours ago
  
  > when unsloth has quants available
  https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
  
  3 replies →
dudefeliciano 8 hours ago

> Size of the model isnt all that matters.
What matters is the motion in the tokens

cmrdporcupine 8 hours ago

A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.