Comment by freejazz
3 months ago
Damn, you'd think OpenAI would have made this argument! Maybe there's something you're missing if this didn't save the day for them.
3 months ago
Damn, you'd think OpenAI would have made this argument! Maybe there's something you're missing if this didn't save the day for them.
No I wouldn't since this is discovery. Maybe there's something you're missing here.
Primarily you seem to be missing the fact that the NYT case is about outputs, not just the training.
Hmm, this is an interesting framing of the lawsuit. If it's about outputs and not just training, are the outputs really orthogonal to the training?
In traditional computer systems, no, outputs are always a function of inputs. LLMs throw a wrench into this reasoning because they apply opaque statistics to a combination of training data and the user prompt to produce outputs, so the input-output relationship is much less clear, but fundamentally it still holds.
So then this case should also be about training. The question then is: did OpenAI intend to have these models be able to regurgitate large amounts of content? Or is it yet another emergent property that nobody anticipated?
I would suspect the latter, because if you view these models as a lossy compression of the whole Internet (cf "Blurry JPEG of the Web" article) it is a surprising outcome that they are able to losslessly reproduce so much of the original content.
So this might come down to intent. Maybe the NYT would need to show that OpenAI intentionally designed for this property, e.g. by rewarding reproductions of entire segments of the original content in its training. In which case, it's looking in the wrong place for evidence.
4 replies →