Comment by KellyCriterion

17 hours ago

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

  • All these companies have the exact same business model as Google (advertising) and have the same mismatched incentives: good search results are not something they want.

    Google Search sucks not because Google is incapable of filtering out spam and SEO slop (though they very much love that people believe they can't), but that spam/slop makes the ads on the SERP page more enticing, and some of the spam itself includes Google Ads/analytics and benefits them there too.

    There is no incentive for these companies to build a good search engine by themselves to begin with, let alone provide data to allow others to build one.

    • I was on the Goog forums for years (before they even fucking ruined the FORMAT of the forums, possibly to 'be more mobile friendly') and it was people absolutely (justifiably) screaming at the product people

      No, the customer isn't 'always' right, but these guys like to get big and once big, fuck you, we don't have to listen to you, we're big; what are you going to do, leave?

  • I mean, doesn't microsoft have bing?

    • Yeah but no one uses it. I am not even sure people that are forced to use it like using it because it was productized it pretty poorly. After all who wants another google? They invested 100 Billion dollars, which is a lot of wasted money TBH.

      Search indexes are hard, surely, but if you were to strip it to just a good index on the browser, made it free, kept it fresh, it cannot be 100 billion dollars to build. Then you use this DoJ decision and fight against google to not deny a free index to have equal rights on chrome you can have a massive shot at a win for a LOT less money.

      8 replies →

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

  • > Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels

    Exactly, why can't we just hoard our bookmarks and a list of curated sources, say 1M or 10M small search stubs, and have a LLM direct the scraping operation?

    The idea is to have starting points for a scraper, such as blogs, awesome lists, specialized search engines, news sites, docs, etc. On a given query the model only needs a few starting points to find fresh information. Hosting a few GB of compact search stubs could go a long way towards search independence.

    This could mean replacing Google. You can even go fully local with local LLM + code sandbox + search stub index + scraper.

  • +1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.

    Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.

    If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.

    I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/