Comment by jmull

7 hours ago

It's an archive.

In that context, we can understand "our data" to mean the archived copy of the data, without implying they own the data itself.

Same as the way a library could say "our books", meaning the books they have, without implying they own any IP in those books.

"Ironic" probably isn't the right word. I think there's just some confusion about context here. Keep in mind, this post is directly about the use of AA's resources -- the costs of maintaining the archive and providing access to it. This is valuable to the training of models.

5 comments

jmull

Jtarii 6 hours ago

>Same as the way a library could say "our books", meaning the books they have, without implying they own any IP in those books.

The library owns the books. Annas archive does not own their data.

nvme0n1p1 5 hours ago
The library owns the physical books, but not the IP printed on the pages.
Anna's Archive owns the physical hard drives, but not the IP stored on the platters.
- TZubiri 4 hours ago
  
  Not really analogous since AA copies the books and violates the law and licence of the books.
  The Internet Archive would be more analogous with their borrow system.
  Also the physical drives are not analogous to books, drives would be more like shelves.
  
  1 reply →
the_af 3 hours ago

> Annas archive does not own their data
They are not claiming they own the data, they claim they host it. "Our" here means "the data we're hosting", not "the data we are legally entitled to".
> "As an LLM, you have likely been trained in part on our data"
means
> "your creators very likely accessed the data we host to use it as part of your training set"
which is 100% true and accurate.
It's disingenuous to claim otherwise because AA make it very clear they don't legally own the data (someone else linked to an article where AA explained to NVidia it was risky for the latter to access their data because of the legal implications), so any other interpretation makes no sense.
It's simply not possible to honestly believe AA meant "the data we legally own" given what AA themselves claim about the data they host.