Comment by notpushkin

4 days ago

I think there’s a couple ways to improve it:

1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.

2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.

How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.

I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.

  • For each book they store its exact location in the torrent files. You can see on the book page, e.g.:

    collection “ia” → torrent “annas-archive-ia-acsm-n.tar.torrent” → file “annas-archive-ia-acsm-n.tar” (extract) → file “notesonsynthesis0000unse.pdf”

    But probably you should get it from the database dumps they provide instead of hammering the website.

    So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.

The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).

I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:

https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...

granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.

There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:

https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

along with others.

There should be a way to leverage compression when storing multiple editions of the same book.

  • From a good search perspective though you probably dont want 500 different versions of the same book popping up for a query

    • Agreed. I would prefer to see a single result for a single title. The option of pursuing different editions should follow from there.