Comment by divineg

7 months ago

It's incredible. I can't believe it but it actually works quite nicely.

If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?

19 comments

divineg

noosphr 7 months ago

It's been clear to anyone familiar with encoder only LLMs that Google is effectively dead. The only reason why it still lives is that it takes a while to crawl the whole web and keep the index up to date.

If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.

ccgreg 7 months ago
The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web.
- noosphr 7 months ago
  
  Yes, I've used your data myself on a number of occasions.
  But you are pretty much the only people who can save the web from AI bots right now.
  The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.
  
  1 reply →
- nickpsecurity 7 months ago
  
  Your terms and conditions include a lot of restrictions with some ambiguous in how they can be interpreted.
  Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people?
  Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too.
  
  8 replies →
nickpsecurity 7 months ago

It's not dead but will take a huge hit. I still use DuckDuckGo since I get good answers, good discovery, taken right to the sources (whom I can cite), and the search indexes are legal vs all the copyright infringement in AI training.
If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.
kiririn 7 months ago
You can see their panic - in my country they are running TV ads for Google search, showing it answering LLM-prompt-like queries. They are desperately trying to win back that mind share, and if they lose traditional keyword search too they’re cooked
- dieortin 7 months ago
  
  Which country is that?

gunalx 7 months ago

Kagi seems to partially be that. Yes really corpo but way Better wibes than Google. Searxng is a bit diffrent but also a thing.

echelon 7 months ago

I think even more spectacularly, we may be witnessing the feature to feature obsolescence of big tech.

Models make it cheap to replicate and perform what tech companies do. Their insurmountable moats are lowering as we speak.

johnthescott 7 months ago

yep, seems the big guys running out of ideas, to some degree.