Comment by serial_dev
3 days ago
The main barriers for me would be:
1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?
2. Potential legal issues.
The technical barriers are at least challenging and interesting.
Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.
1. It'd be for the scientific community (broadly-construed). Converting media that is currently completely un-indexed into plaintext and offering a suite of search features for finding content within it would be a game-changer, IMO! If you've ever done a lit review for any field other than ML, I'm guessing you know how reliant many fields are on relatively-old books and articles (read: PDFs at best, paper-only at worst) that you can basically only encounter via a) citation chains, b) following an author, or c) encyclopedias/textbooks.
2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.
In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.
> I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it.
You don't need to host copyrighted material. It's all about intent. The Pirate Bay is (imo correctly, even if I disagree with other aspects about copyright law and its enforcement) seen as a place where people go to find ways to not pay authors for their content. They never hosted a copyrighted byte but they're banned in some form (DNS, IP, domain seizures) in many countries. Proxies of TPB also, so being like an ISP for such a site is already enough, whereas nobody is ordering blocks of Comcast's IP addresses for providing access to websites with copyrighted material because they didn't have a somewhat-provable intent to provide copyright infringement
When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub, but I think you'd have to spin it as a general purpose search page and ideally not even mention AA was one of the sources, much less have links
(Don't get me wrong: everyone wants this except the lobby of journals that presently own the rights)
It would be a real shame if an anonymous third party that's definitely not the website operator made a Firefox add-on that illegitimately inserts these links to search results page though
> When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub
You could just give users ISBNs or link to the book's metadata on openlibrary[0], both of which AA's native search already does.
[0] https://openlibrary.org/
1 reply →
Yeah but how does the search work, does it show a portion of the text? If it's a portion of the text isn't that also a part of the book?
But he did not mention anything about creating a "service"
It could be his own copy for personal use
What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable
The data might seem large today, but it might not seem large or unmanageable in the future
It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
Do you have a reason to believe this ain't already being done? I would assume that the big guys like openai are already training on basically all text in existence.
In fact, facebook torrented annas archive and got busted for it, because of course they did:
https://torrentfreak.com/meta-torrented-over-81-tb-of-data-t...
5 replies →
Wasn't this confirmed what Meta does?
https://www.forbes.com/sites/danpontefract/2025/03/25/author...
> Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.
Yet some people still believe "it would have to be done in evil Russia".
OP does have an exaggerated statement - its not like there aren't laws in Russia or something and I largely agree with your sentiment. I think there are levels to this though and its pretty clear that Russia is much riskier than the USA when it comes to IP - just look up anything to do with insuring IP risk in Russia (here's one such example: https://baa.no/en/articles/i-have-ip-in-russia-is-my-ip-at-r...)
Also according to the office of US trade representative, Russia is on the priority watch list of countries that do not respect IP [1] and post 2022, largely due to the war, Russia implemented measures negatively effecting IP rights. [2,3]
If you think it isn't the case and Russia is just as risky as the US when it comes to copyright and IP, I would be interested to know why.
1. https://ustr.gov/about/policy-offices/press-office/press-rel... 2. https://www.papula-nevinpat.com/executive-summary-the-ip-sit... 3. https://www.taftlaw.com/news-events/law-bulletins/russia-iss...
> evil
In this case and context, a label like "evil" is a twisted interpretation.
> or some other country that doesn't respect international copyright though.
Like the US? OpenAI et al. don't give a shit.
There's a difference between feeding massive amounts of copyrighted material to a training process that blends them thoroughly and irreversibly, and doing all that in-house, vs. offering people a service that indexes (and possibly partially rehosts) that material, enabling and encouraging users to engage directly in pirating concrete copyrighted works.
12 replies →
> > or some other country that doesn't respect international copyright though.
> Like the US? OpenAI et al. don't give a shit.
OpenAI is not a country and therefore cannot make laws that don't respect international (or domestic) copyright. Also the US is a lot bigger than OpenAI and the big tech corps, and the law is very much on the side of copyright holders in the US.
3 replies →
LLMs already use it, dude )
I think one use would be to search for information directly from a book, rather than get a garbled/half-hallucinated version of it.
3 replies →
> 1. Why? Who would use that?
Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?
People who need or want the highest quality information available will pay for it. I'd easily pay for it.