Comment by KellyCriterion

18 days ago

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

23 comments

KellyCriterion

ghm2199 18 days ago

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

zanderz 17 days ago

I learned on here that this has been happening to a degree with maps. Several big companies have been cooperating to improve open street map data, a rare example of a beneficial commons. This is probably some unique accident of incentives and timing and history but maybe it could happen in other domains.
Nextgrid 18 days ago
All these companies have the exact same business model as Google (advertising) and have the same mismatched incentives: good search results are not something they want.
Google Search sucks not because Google is incapable of filtering out spam and SEO slop (though they very much love that people believe they can't), but that spam/slop makes the ads on the SERP page more enticing, and some of the spam itself includes Google Ads/analytics and benefits them there too.
There is no incentive for these companies to build a good search engine by themselves to begin with, let alone provide data to allow others to build one.
- alex1138 18 days ago
  
  I was on the Goog forums for years (before they even fucking ruined the FORMAT of the forums, possibly to 'be more mobile friendly') and it was people absolutely (justifiably) screaming at the product people
  No, the customer isn't 'always' right, but these guys like to get big and once big, fuck you, we don't have to listen to you, we're big; what are you going to do, leave?
t_mahmood 16 days ago

They will prefer to band up with Google, and rip us off.
Imustaskforhelp 18 days ago
I mean, doesn't microsoft have bing?
- ghm2199 18 days ago
  
  Yeah but no one uses it. I am not even sure people that are forced to use it like using it because it was productized it pretty poorly. After all who wants another google? They invested 100 Billion dollars, which is a lot of wasted money TBH.
  Search indexes are hard, surely, but if you were to strip it to just a good index on the browser, made it free, kept it fresh, it cannot be 100 billion dollars to build. Then you use this DoJ decision and fight against google to not deny a free index to have equal rights on chrome you can have a massive shot at a win for a LOT less money.
  
  9 replies →

renegat0x0 18 days ago

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

SyneRyder 18 days ago
+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.
Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.
If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.
I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/
- carte_blanche 16 days ago
  
  Do you have any documentation/blog post for this? I would love to do something similar for my own use.
  
  1 reply →
visarga 18 days ago
> Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels
Exactly, why can't we just hoard our bookmarks and a list of curated sources, say 1M or 10M small search stubs, and have a LLM direct the scraping operation?
The idea is to have starting points for a scraper, such as blogs, awesome lists, specialized search engines, news sites, docs, etc. On a given query the model only needs a few starting points to find fresh information. Hosting a few GB of compact search stubs could go a long way towards search independence.
This could mean replacing Google. You can even go fully local with local LLM + code sandbox + search stub index + scraper.
- direwolf20 17 days ago
  
  Marginalia Search does something like this
ghm2199 18 days ago

When I saw the Internet-Places-Database I thought it was an index on some sort of PoI and I got curious. But the personal internet spiel is pretty cool. One good addition to this could be the Foursquare PoI dataset for places search: https://opensource.foursquare.com/os-places/