← Back to context

Comment by kachapopopow

1 day ago

I know for a fact that all SOTA models have linux source code in them, intentionally or not which means that they should follow the GPL license terms and open-source part of the models which have created derivative works out of it.

yes, this is indirectly hinting that during training the GPL tainted code touches every single floating point value in a model making it derivative work - even the tokenizer isn't immune to this.

> the tokenizer isn't immune to this

A tokenizer's set of tokens isn't copyrightable in the first place, so it can't really be a derivative work of anything.

  • GPL however, does put restrictions on it, even the tokenizer. It was specifically crafted in a way where even if you do not have any GPL licensed sourcecode in your project, but it was built on top of it you are still binded by GPL limitations.

    the only reason usermode is not affected is because they have an exclusion for it and only via defined communication protocol, if you go around it or attempt to put a workaround in the kernel guess what: it still violates the license - point is: it is very restrictive.

    • > GPL however, does put restrictions on it, even the tokenizer. It was specifically crafted in a way where even if you do not have any GPL licensed sourcecode in your project, but it was built on top of it you are still binded by GPL limitations.

      This is not how copyright law works. The GPL is a copyright license, as stated by the FSF. Something which is not subject to copyright cannot be subject to a copyright license.

      3 replies →

When you say “in” them, are you referring to their training data, or their model weights, or the infrastructure required to run them?

  • GPL can be considered like a virus, something based on GPL licensed code (unless explicitely excluded by the license) is now GPL licensed so the 'injected' training data becomes GPL licensed which means that created model weights from them in theory should also become GPL licensed.