Comment by terminalshort

3 months ago

No I wouldn't since this is discovery. Maybe there's something you're missing here.

6 comments

terminalshort

freejazz 3 months ago

Primarily you seem to be missing the fact that the NYT case is about outputs, not just the training.

keeda 3 months ago
Hmm, this is an interesting framing of the lawsuit. If it's about outputs and not just training, are the outputs really orthogonal to the training?
In traditional computer systems, no, outputs are always a function of inputs. LLMs throw a wrench into this reasoning because they apply opaque statistics to a combination of training data and the user prompt to produce outputs, so the input-output relationship is much less clear, but fundamentally it still holds.
So then this case should also be about training. The question then is: did OpenAI intend to have these models be able to regurgitate large amounts of content? Or is it yet another emergent property that nobody anticipated?
I would suspect the latter, because if you view these models as a lossy compression of the whole Internet (cf "Blurry JPEG of the Web" article) it is a surprising outcome that they are able to losslessly reproduce so much of the original content.
So this might come down to intent. Maybe the NYT would need to show that OpenAI intentionally designed for this property, e.g. by rewarding reproductions of entire segments of the original content in its training. In which case, it's looking in the wrong place for evidence.
- freejazz 3 months ago
  
  >Hmm, this is an interesting framing of the lawsuit.
  First, it's not a "framing" of the lawsuit. A lawsuit is a number of claims made by one party against the other. In the two California cases, there were no decisions made on claims relating to LLM outputs. In the NYT case, there are claims relating to LLM outputs.
  Yes, it could also be about training. But the discovery pertains to the outputs, which is the issue in this case. So even if you apply the holding that training is fair, which I don't see likely to happen in the district courts of the second circuit, you still don't get the result that the person I responded to suggested, which was that this should all be moot because of two decisions in two different cases in California which are not binding precedent in the 2nd circuit, and which also would not dispose of all of NYT's claims.
  >So then this case should also be about training. The question then is: did OpenAI intend to have these models be able to regurgitate large amounts of content? Or is it yet another emergent property that nobody anticipated?
  Intent is not a required element of copyright infringement, so you'd be wrong there. Plaintiffs can use intent to evidence willful infringement, which they are entitled to do in statutory damages cases, and receive a damages multiplier, which this one is. So OpenAI can't avoid liability based on their intent or a lack thereof. They can only, at best, use 'intent' to establish that NYT's is not entitled to heightened damages.
  >So this might come down to intent.
  It's always amusing to see people apply completely made up rationales to legal cases based upon their own personal feelings about technologies while completely disregarding, lets say, 100 years of legal jurisprudence.
  
  3 replies →