← Back to context

Comment by polyrand

2 days ago

A few comments mentioning distillation. If you use claude-code with the z.ai coding plan, I think it quickly becomes obvious they did train on other models. Even the "you're absolutely right" was there. But that's ok. The price/performance ratio is unmatched.

I had Gemini 3 Flash hit me this morning with "you're absolutely right" when I corrected it on a mistake it did. It's not conclusive of anything.

  • That's interesting, thanks for sharing!

    It's a pattern I saw more often with claude code, at least in terms of how frequently it says it (much improved now). But it's true that just this pattern alone is not enough to infer the training methods.

I imagine - and sure hope so - everyone trains on everything else. Distillation - ofc if one has bigger/other models providing true posterior token probabilities in the (0,1) interval (a number between 0 and 1), rather than 1-hot-N targets that are '0 for 200K-sans-this-token, and 1 for the desired output token' - one should use the former instead of the latter. It's amazing how as a simple as straightforward idea should face so much resistance (paper rejected) and from the supposedly most open minded and devoted to knowing (academia) and on the wrong grounds ('will have no impact on industry'; in fact - it's had tremendous impact on industry; better rejection wd have been 'duh it is obvious'). We are not trying to torture the model and the gpu cluster to be learning from 0 - when knowledge is already available. :-)

>Even the "you're absolutely right" was there.

I don't think that's particularly conclusive for training on other models. Seems plausible to me that the internet data corpus simply converges on this hence multiple models doing this.

...or not...hard to tell either way.