← Back to context

Comment by ronsor

1 day ago

> GPL however, does put restrictions on it, even the tokenizer. It was specifically crafted in a way where even if you do not have any GPL licensed sourcecode in your project, but it was built on top of it you are still binded by GPL limitations.

This is not how copyright law works. The GPL is a copyright license, as stated by the FSF. Something which is not subject to copyright cannot be subject to a copyright license.

GPL is not only a copyright license, it also covers multiple types of intellectual property rights. Especially when you consider GPL-3 which has explicit IP protection while GPL-2 is implicit, so yah you're partially right for GPL-2 and wrong for GPL-3.

  • It's true that GPLv3 covers patents, but it is still primarily a copyright license.

    The tokenizer's tokens aren't patented, for sure. They can't be trademarked (they don't identify a product or service). They aren't a trade secret (the data is public). They aren't copyrighted (not a creative work). And the GPL explicitly preserves fair use rights, so there are no contractual restrictions either.

    A tokenizer is effectively a list of the top-n most common byte sequences. There's simply no basis in law for it to be subject to copyright or any other IP law in the average situation.

    • I mean okay sure, there is no legal framework for tokenizers, but what about the rest of the model I think there is a much stronger argument there? And you could realistically extend the logic that if the model is GPL-2.0 licensed you have to provide all the tools to replicate it which would include the tokenizer.