Comment by teiferer
11 days ago
Can we please stop calling those models "open source"? Yes the weights are open. So, "open weight" maybe. But the source isn't open, the thing that allows to re-create it. That's what "open source" used to mean. (Together with a license that allows you to use that source for various things.)
No major AI lab will admit to training on proprietary or copyrighted data so what you are asking is an impossibility. You can make a pretty good LLM if you train on Anna's Archive but it will either be released anonymously, or with a research only non commercial license.
There aren't enough public domain data to create good LLMs, especially once you get into the newer benchmarks that expect PhD level of domain expertise in various niche verticals.
It's also a logical impossibility to create a zero knowledge proof that will allow you to attribute to specific training data without admitting to usage.
I can think of a few technical options but none would hold water legally.
You can use a Σ-protocol OR-composition to prove that it was trained either on a copyrighted dataset or a non copyrighted dataset without admitting to which one (technically interesting, legally unsound).
You can prove that a model trained on copywrited data is statistically indistinguishable from one trained on non-copywrited data (an information theoretic impossibility unless there exist as much public domain data as copywrited data, in similar distributions).
You can prove a public domain and copywrited dataset are equivalent if the model performance produced is indistinguishable from each other.
All the proofs fail irl, ignoring the legal implications, because there's less public domain information, so given the lemma that more training data == improved model performance, all the above are close to impossible.