← Back to context

Comment by nl

4 days ago

> The analogy falls apart very quickly. Without the training data, your modifications amount to virtually nothing compared to what these "versions" are, and the idea that you can maintain and improve on these models without the continual support of the company that owns the training data AND harnesses AND in general build instructions is not very credible.

This is completely wrong, and sort of shows why what you are saying is not a problem at all.

You can post-train any LLM very easily without access to the original training data.

People do it all the time.

Cursor post-training Kimi K2 is a great example.

> If Qwen decides to stop distributing models for download, you're basically stuck, _even_ if you have unlimited resources, it's not clear how the released weights help you; your best bet is to start almost from scratch.

What are you talking about? You just post-train it.

There is exactly zero different before and after they stop distributing it. People don't have access to the training data now (when they are distributing it) and post train very successfully.

What would you even use the training data for?

> You can post-train any LLM very easily without access to the original training data.

Are you claiming this is e.g. what Alibaba spends their time doing?

My point is that the usefulness of this is limited _in comparison to the one provided by having their training data AND mechanisms_.

  • > what Alibaba spends their time doing?

    Not most of the time (pre-training takes a long time), but post-training is where most of the value is, yes.

    Famously it is all that OpenAI did between GPT 4o and GPT 5.3 (or 5.2?) - they didn't manage to complete a pre-training run[1], and all their progress was done with post-training (!)

    Post training what Cursor spends their time doing, and that has built a model that is competitive with the best coding models out there.

    It isn't limited at all.

    If you want to complain about something not being open source, complain about the lack of good open source RL environments (Prime Intellect excepted).

    [1] https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

    • > It isn't limited at all.

      Your very message is already showing that indeed it is limited, so dunno where you get that "that's where most of the value is". It is definitely not , and your very own link is showing that the the limit is there. This is not to say that there is _no_ value whatsoever, but that the value is negligible compared to what someone with _the real source_ could do. See the Rio model for another example.

      > If you want to complain about something not being open source, complain about the lack of good open source RL environments (Prime Intellect excepted).

      This is implicitly included when I was emphasizing "AND the software used to build the model", which I did for a reason.

      1 reply →