Comment by starchild3001

14 hours ago

Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these models.

The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.

A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.

The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?

11 comments

starchild3001

algo_trader 9 hours ago

Are all these "post/mid-training tweaks" important if you have a specific domain with abundant/verified/synthesis data and labels?

Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements?

calmoo 4 hours ago

[flagged]

dwaltrip 1 hour ago

I see your points, but is this actually slop in this case? Is the comment incorrect or misleading at all?
It felt interesting and informative to me, but I didn’t verify any of it.
Good eye btw.
sapphire42 3 hours ago
The comment you're replying to is 100% AI-generated. How does obviously LLM-generated content continually make it to the front of HN, and why in God's name are you being downvoted for calling this out??
"...a fascinating approach..." (LLMs think everything is fascinating)
"...they're essentially having a generalist learn from a committee of specialists..." (analogies, analogies)
"...where APIs are undocumented, partial failures are common, and user input is full of ambiguity..." (typical AI rule of three template with semantically similar parameters that contribute nothing to the overall meaning)
- calmoo 2 hours ago
  
  It does worry me how defensive people can become over really obvious slop - I don't think I'm even particularly attuned to the style of LLM writing but it is incredibly obvious every time I see it. It's only going to get worse I think.
  
  1 reply →
- unshavedyak 2 hours ago
  
  > and why in God's name are you being downvoted for calling this out??
  Tinfoil hat time, but perhaps the bots don't like being called out? I don't actually take that statement seriously, but it seems an eventual avenue. They've long been seeding threads on Reddit to shape initial hive mind, i imagine that's going to get more advanced and widespread.
jasonjmcghee 2 hours ago

> ...is what really stands out to me. The idea of...
> ...are pure gold for anyone working in this space...
Specifically OpenAI
HSO 2 hours ago

不管黑猫白猫，能捉到老鼠就是好猫
ranyume 4 hours ago
You did call out.
- calmoo 3 hours ago
  
  If you read my comment closely, I didn't deny calling anyone out.