Comment by VorpalWay

3 days ago

Aren't they only open weights, not true open source?

The concept of open source doesn't really apply to AI models since their behavior is mostly controlled by the data they were trained on and the complex ways they are trained. Having the source code of the model by itself wouldn't help you.

From a practical POV having all the training data, training infrastructure, and training know-how wouldn't help you either unless you could afford to spend the millions of dollars (hundreds of millions for a SOTA model) in compute to train it each time they released a new training set, in which case you're only talking about the big commercial companies. "open source for the people" just does not apply.

  • If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.

    But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.

    • > If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.

      Yes, but for whatever reason this usage seems to have stuck. Open weights is definitely a better name. I assume the reason "open source" has stuck is because you can download and use it for free, but "open source" was always intended to be about "free as in speech", not "free as in beer". That said, I remember when the term "open source" was invented, and it was always a bit different, more commercially aligned, than the goals of the FSF.

      > But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.

      True. Unfortunately LLMs have become such a big money and closed enterprise (the opposite of OpenAI and Anthropic's altruistic founding principles) that it's hard to see these commercial models releasing their training data, especially since this data is the closest thing they have to a moat other than the cost of training.

      The most valuable training data right now seems to be "reasoning data", and the need for this at least may disappear as AI moves beyond pre-trained language models to smarter systems capable of learning for themselves, and that can actually reason, not need to parrot reasoning data.

  • Publishing RL/SFT/self-distillation harnesses would be very impactful even without the data.

    Particularly when it comes to tool use w/ self-distillation it can be done without any data... have a tool the model doesn't know? a teacher model RTFMs and the source code, and helps the student learn to get it right.