Ask HN: Has anybody built search on top of Anna's Archive?
3 days ago
Wouldn't this basically give us Google Books and searchable Scihub at the same time?
What would it cost?
3 days ago
Wouldn't this basically give us Google Books and searchable Scihub at the same time?
What would it cost?
Honestly I don't think it would be that costly, but it would take a pretty long time to put together. I have a (few years old) copy of Library Genesis converted to plaintext and it's around 1TB. I think libgen proper was 50-100TB at the time, so we can probably assume that AA (~1PB) would be around 10-20TB when converted to plaintext. You'd probably spend several weeks torrenting a chunk of the archive, converting everything in it to plaintext, deleting the originals, then repeating with a new chunk until you have plaintext versions of everything in the archive. Then indexing all that for full text search would take even more storage and even more time, but still perfectly doable on commodity hardware.
The main barriers are going to be reliably extracting plaintext from the myriad of formats in the archive, cleaning up the data, and selecting a decent full text search database (god help you if you pick wrong and decide you want to switch and re-index everything later).
The main barriers for me would be:
1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?
2. Potential legal issues.
The technical barriers are at least challenging and interesting.
Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.
1. It'd be for the scientific community (broadly-construed). Converting media that is currently completely un-indexed into plaintext and offering a suite of search features for finding content within it would be a game-changer, IMO! If you've ever done a lit review for any field other than ML, I'm guessing you know how reliant many fields are on relatively-old books and articles (read: PDFs at best, paper-only at worst) that you can basically only encounter via a) citation chains, b) following an author, or c) encyclopedias/textbooks.
2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.
In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.
4 replies →
But he did not mention anything about creating a "service"
It could be his own copy for personal use
What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable
The data might seem large today, but it might not seem large or unmanageable in the future
It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
35 replies →
> 1. Why? Who would use that?
Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?
People who need or want the highest quality information available will pay for it. I'd easily pay for it.
I think there’s a couple ways to improve it:
1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.
2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.
How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.
I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.
1 reply →
The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).
I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:
https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...
granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.
There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:
https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...
along with others.
2 replies →
There should be a way to leverage compression when storing multiple editions of the same book.
5 replies →
[dead]
I wonder if you could implement it with only static hosting?
We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.
Hosting this would be cheap, but the main barriers remain..
I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.
I can even use Sqlite's full-text search capabilities!
4 replies →
It's trivial to normalise the various formats, and there were a few libraries and ML models to help parse PDFs. I was tinkering around with something like this for academic papers in Zotero, and the main issue I ran into was words spilling over to the next page, and footnotes. I totally gave up on that endeavour several years ago, but the tooling has probably matured exponentially since then.
As an example, all the academic paper hubs have been using this technology for decades.
I'd wager that all of the big Gen AI companies have planned to use this exact dataset, and many or them probably have already.
> It's trivial to normalise the various formats,
Ha. Ha. ha ha ha.
As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.
You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.
4 replies →
Decent storage is $10/TB, so for $10,000 you could just keep the entire 1PB of data.
A rather obvious question is if someone has trained an LLM on this archive yet.
A rather obvious answer is Meta is currently being sued for training Llama on Anna's archive.
You can be practically certain that every notable LLM has been trained on it.
1 reply →
They did! They conducted a competition https://annas-archive.org/blog/all-isbns-winners.html , in which a few submissions exceeded the minimum requirements and implemented a good search tool & visualiser.
I think OP was more interested in the ability to text search through the contents. This competition was great and some of the entries were really informative, but none of them included a full text search of the contents of all books.
How is this a text search of the books?
The original question the poster made was not clear, so this is also an answer to it. It depends on what they meant by "search"
You must mean free text search and page level return, because it already has full metadata indexing.
The thing is AA doesn't hold the texts. They're disputable IPR and even a derived work would be a legal target.
> a derived work would be a legal target.
Why would it? Google isn't prosecuted for indexing the web.
Oh it certainly is. https://www.reuters.com/sustainability/boards-policy-regulat...
2 replies →
There’s an android app called OpenLip. [1]
Description:
Openlib is an open source app to download and read books from shadow library (Anna’s Archive). The App Has Built In Reader to Read Books.
As Anna’s Archive doesn't have an API, the app works by sending requests to Anna’s Archive and parses the response to objects. The app extracts the mirrors from the responses, downloads the book and stores it in the application's document directory.
Note : The app requires VPN to function properly . Without VPN the might show the captcha required page even after completing the captcha
Main Features:
Trending Books
Download And Read Books With In-Built Viewer
Supports Epub And Pdf Formats
Open Books With Your Favourite Ebooks Reader
Filter Books
Sort Books
[1]: https://f-droid.org/de/packages/com.app.openlib/
As far as I know, no one has fully implemented full-text search directly over Anna's Archive. Technically it’s feasible with tools like Meilisearch, Elasticsearch, or Lucene, but the main challenges are:
Z-Library does something similar, but it’s smaller in scope and doesn't integrate AA’s full catalog.
I’ve done something like this before. Meilisearch will not be viable, because it indexes very slow and it takes up a lot of space.
In my experience only Tantivy can index this much data. Check out Lnx.
Lucene would fo fine as well, I guess. As much as I like the author of Tantivy, it is a toy compared to Lucene.
To manage the legal issues, you just have to put AI on the search. "AI search".
[dead]
Z-Library has a keyword search. Personally i didn't find it too useful, especially given Google Books exists. It's not easy to create a quality book search engine.
AFAIK, Z-Library already does this, to some extent. Basic full-text queries do search inside the body of books and articles.
It's a bit smaller than Anna's Archive, as they do host their own collections. From some locations, it's only easy to access through Tor.
Related question, has Anna's archive been thoroughly filtered for non-copyright-related illegal material? Pedo, terrorism, etc. I've considered downloading a few chunks of it but I'm worried of ending up with content I really don't want to be anywhere near from.
The team that curates it is very dedicated and wouldn't do such a thing. The least of reasons being they don't want the heat from it.
I'm not sure what other forms of information is illegal beyond CP. In the US, bomb making instructions are not illegal. In other dictatorships or zealous religious regimes, information about democracy or works that insult Islam might be illegal
This is a really strange question to be honest you could ask this literally about any download let alone simply torrents of documents.
It's the textbook example of the "chilling effect" created by mass surveillance.
Download everything, we know that laws don't apply when you do it on a large enough scale. Not legal advice.
I think you got that wrong. Laws only don't apply if you are large enough. (Like Meta)
How might you inadvertently download illegal content while searching for legal content?
He said he wants to download lots of it in general, not specifical. Legit question, if you end up with dark material.
I would assume pedo stuff is not really there, but the anarchist cookbook and alike likely will be.
20 replies →
Seeding torrrent blocks.
A functional full text search of the shadow libraries would be massive. It would have a comparable impact on humanity to the impact AI will have. And it's probably not difficult technically. Let's start a project to get this done!
Edit: I have had this exact project as my dream for a couple of years, and even experimented a little bit. But I'm not a programmer, so I can only understand theoretically what would be needed for this to work.
Anybody with the same dream, send me an e-mail to booksearch@fastmail.com and let's see what we can do to get the ball rolling!
The indexing costs would be nuts - Anna's Archive is like 200TB+ and growing fast. Even with decent search infra you're looking at serious compute/storage costs. Plus there's the obvious legal stuff that would make this a no-go for most companies with anything to lose. The decentralized thing they're doing probably makes way more sense.
How serious compute/storage cost? $10 000 per month? $100 000 per month?
[dead]
Probably this was already done at Google, Meta, X and OpenAI, before training their LLMs.
There's actually section in the Wikipedia page that explicitly says DeepSeek was trained on it
I have found some searche engines, but I do not think they're for Anna's.
https://searchthearxiv.com/
https://refseek.com/
https://arxivxplorer.com/
There is a search solution for zipped fb2 files. Not exactly what you need, but it has potential.
The project has similar story to Anna's archive. There is 0.5 TB of archived books, and the project creates index of all the books with text, title and aruthor search capabilities, gives html UI for search and reading. On weak machine it takes about 2 hours to build that index.
So if you have zipped archives of fb2, you can use the project to create web UI with search for those files. Without need of enough space to unpack all the files.
You'll have to translate some russian though to get instructions on how to set it up.
https://gitlab.com/opennota/fb2index/-/blob/master/README.ru...
But fb2 files are marked up text, which is (relatively) trivial to index. The bulk of Anna's Archive's books are made of from scanned images.
Worth mentioning that 0.5TB is tiny compared to Anna’s, which currently sits around 1.1PB.
Has anyone explored a different angle — like mapping out the 1,000 most frequently mentioned or cited books (across HN, Substack, Twitter, etc.), then turning their raw content into clean, structured data optimized for LLMs? Imagine curating these into thematic shelves — say, “Bill Gates’ Bookshelf” or “HN Canon” — and building an indie portal where anyone can semantically search across these high-signal texts. Kind of like an AI-searchable personal library of the internet’s favorite books.
Well, there's this: https://hacker-recommended-books.vercel.app/category/0/all-t...
small number of people willing to put in significant engineering hours for something that would be illegal and non-monetizable
Facebook said they leeched it, and Anna once mentioned a few companies most of them from China paid for it, so I assume the answer is yes someone has the data and very likely built the search, but no one will open it given the legal and reputation risk.
Seeing as OpenAI & Co were trained on torrented books from similar places, I'm sure that ChatGPT provides an adequate search layer on top of Anna's Archive, though it is not as free from confabulations as one might hope for in a search engine.
Edit: grammar
Facebook did it's ai is trained on it so you can use that.
yes, every major llm company did it:
illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.
No, because you can't avert the legal issues of doing that.
This works in various search engines
site:annas-archive.org avacado
It's not exactly clear, but OP is asking about indexing the content of all the documents, not the metadata (e.g. titles etc)
[dead]
[dead]
[dead]
https://book-finder.tiiny.site/
More: https://rentry.co/StellaOctangulaIsCool
> https://book-finder.tiiny.site/
That just redirects to https://yandex.com/search
Well yeah but with a specific query with which you can search multiple libraries
Don't do it. Just because you can, doesn't mean you should. Do you know if they have anywhere near the legal muscle to push back the flood of legal notices if you did this? Assume it survives because it doesn't have a wide open barn door to the public.
It wouldn’t be called full text search of AA, It would be called full tech search of every book in the world.
You are asking a judge to consider that a book is ok to scrape because it's part of a much larger collection of books, perhaps the biggest and best collection, and therefore it's all OK because at scale means good.
2 replies →
Mebbe easier to just search Amazon or Goodreads. Like site:amazon.ca <query words> as someone has mentioned below.
Every book has an ISBN 10 or 13 digit ISBN number to identify them. Unless it's some self-pub/amateur-hour situation by some paranoid prepper living in a faraday-cage-protected cage in Arkansas or Florida it's likely a publication with a title, an author and an ISBN number.
A self-pub amateur-hour book printed by a paranoid prepper living in a faraday cage is exactly the type of book I'd probably enjoy reading, but I doubt these exist anymore.
I know, remember Loopanics?
What about pre-1970 books?