← Back to context

Comment by bhaney

4 days ago

Honestly I don't think it would be that costly, but it would take a pretty long time to put together. I have a (few years old) copy of Library Genesis converted to plaintext and it's around 1TB. I think libgen proper was 50-100TB at the time, so we can probably assume that AA (~1PB) would be around 10-20TB when converted to plaintext. You'd probably spend several weeks torrenting a chunk of the archive, converting everything in it to plaintext, deleting the originals, then repeating with a new chunk until you have plaintext versions of everything in the archive. Then indexing all that for full text search would take even more storage and even more time, but still perfectly doable on commodity hardware.

The main barriers are going to be reliably extracting plaintext from the myriad of formats in the archive, cleaning up the data, and selecting a decent full text search database (god help you if you pick wrong and decide you want to switch and re-index everything later).

The main barriers for me would be:

1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?

2. Potential legal issues.

The technical barriers are at least challenging and interesting.

Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.

  • 1. It'd be for the scientific community (broadly-construed). Converting media that is currently completely un-indexed into plaintext and offering a suite of search features for finding content within it would be a game-changer, IMO! If you've ever done a lit review for any field other than ML, I'm guessing you know how reliant many fields are on relatively-old books and articles (read: PDFs at best, paper-only at worst) that you can basically only encounter via a) citation chains, b) following an author, or c) encyclopedias/textbooks.

    2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.

    In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.

    • > I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it.

      You don't need to host copyrighted material. It's all about intent. The Pirate Bay is (imo correctly, even if I disagree with other aspects about copyright law and its enforcement) seen as a place where people go to find ways to not pay authors for their content. They never hosted a copyrighted byte but they're banned in some form (DNS, IP, domain seizures) in many countries. Proxies of TPB also, so being like an ISP for such a site is already enough, whereas nobody is ordering blocks of Comcast's IP addresses for providing access to websites with copyrighted material because they didn't have a somewhat-provable intent to provide copyright infringement

      When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub, but I think you'd have to spin it as a general purpose search page and ideally not even mention AA was one of the sources, much less have links

      (Don't get me wrong: everyone wants this except the lobby of journals that presently own the rights)

      It would be a real shame if an anonymous third party that's definitely not the website operator made a Firefox add-on that illegitimately inserts these links to search results page though

      2 replies →

    • Yeah but how does the search work, does it show a portion of the text? If it's a portion of the text isn't that also a part of the book?

  • But he did not mention anything about creating a "service"

    It could be his own copy for personal use

    What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable

    The data might seem large today, but it might not seem large or unmanageable in the future

  • It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.

    • Do you have a reason to believe this ain't already being done? I would assume that the big guys like openai are already training on basically all text in existence.

      7 replies →

    • > Would probably have to be done in Russia or some other country that doesn't respect international copyright though.

      Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.

      Yet some people still believe "it would have to be done in evil Russia".

      3 replies →

  • > 1. Why? Who would use that?

    Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?

    People who need or want the highest quality information available will pay for it. I'd easily pay for it.

I think there’s a couple ways to improve it:

1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.

2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.

  • How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.

    I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.

    • For each book they store its exact location in the torrent files. You can see on the book page, e.g.:

      collection “ia” → torrent “annas-archive-ia-acsm-n.tar.torrent” → file “annas-archive-ia-acsm-n.tar” (extract) → file “notesonsynthesis0000unse.pdf”

      But probably you should get it from the database dumps they provide instead of hammering the website.

      So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.

  • The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).

    I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:

    https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...

    granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.

    There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:

    https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

    along with others.

I wonder if you could implement it with only static hosting?

We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.

Hosting this would be cheap, but the main barriers remain..

  • I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.

    I can even use Sqlite's full-text search capabilities!

It's trivial to normalise the various formats, and there were a few libraries and ML models to help parse PDFs. I was tinkering around with something like this for academic papers in Zotero, and the main issue I ran into was words spilling over to the next page, and footnotes. I totally gave up on that endeavour several years ago, but the tooling has probably matured exponentially since then.

As an example, all the academic paper hubs have been using this technology for decades.

I'd wager that all of the big Gen AI companies have planned to use this exact dataset, and many or them probably have already.

  • > It's trivial to normalise the various formats,

    Ha. Ha. ha ha ha.

    As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.

    You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.

Decent storage is $10/TB, so for $10,000 you could just keep the entire 1PB of data.

A rather obvious question is if someone has trained an LLM on this archive yet.

  • A rather obvious answer is Meta is currently being sued for training Llama on Anna's archive.

    You can be practically certain that every notable LLM has been trained on it.

    • > You can be practically certain that every notable LLM has been trained on it.

      But only Meta was kind of not so smart to publicly admit it.