Comment by ellisd
3 months ago
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.
Previous paper from DeepSeek has mentioned Anna’s Archive.
> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper
Why do they need to grant access for people to use copies of books they don’t own?
Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.
As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.
Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access to copyrighted material they don't own the rights to
> The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space
Ownership laundering.
Yes it means they will never release their dataset :(
hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released
Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that students and researchers use, META torrenting 70TB from library genesis wasn't enough
Seems like they are doing fine:
https://open-slum.org
Yeah, for now, Meta torrented 70TB and right after that they cut the rope for everyone else, mysteriously their hitman (US govenrment) hit both Libgen and Z-Lib shortly after.
It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.
[dead]