Comment by swatcoder
3 years ago
You’re seeing the symptom of something deeper.
Conventional search was a ~20 year solution to navigating the “entirety” of online content when the available content was within the scope of that innovation. That era is coming to an end. There’s just too much content to index literally and too much noise too quantify quality and that problem is getting worse much faster than crawl+search technology can scale.
So new techniques to navigating content are emerging, some of them calling back to pre-search solutions.
LLM chat assistants drop the literal reference requirement by just mushing up all the sources they can and hallucinating something vaguely relevant to incoming questions. They lean into the noise and try to find patterns in it rather than sources.
Meanwhile, “walled garden” private communities like Discord, Slack, Whatsapp/iMessage, and the growing list of login-required social content sites commit to sharing literal source content but address the noise problem by regimenting and moderating how content is incorporated.
There will almost certainly be a next generation “meta-search” that can help you frame and make queries across these walled gardens, but it’s going to take a long while for the infrastructure and business models around that to establish themselves.
In the meantime, this is what we get and what we can expect for a while.
This is correct. Google became a monopoly and they stopped caring about surfacing any results that they couldn't immediately monetize. If your content is archived in a forum somewhere, Google won't find it anyway, it'll instead show you results from youtube, ads, and whatever is on top of their cache. Search is effectively dead, so we have to resort to asking other humans directly for answers. Which sucks, but that's the phase of the competition/monopoly cycle that we're in right now.
ignoring the problem of robots.txt inaccessibility, is it feasible to have Kagi-style "private google" with a more limited number of high-signal-to-noise sites, especially if you drop the concept of e-commerce and some other low-SNR feeds?
perhaps one interesting thing is that a decent number of the highest-SNR feeds don't actually need to be crawled at all - wikipedia, reddit, etc are available as dumps and you can ingest their content directly. And the sources in which I am most interested in for my hobbies (technical data around cameras, computer parts, aircraft, etc) tend to be mostly static web-1.0 sites that basically never change. There's some stuff that falls inbetween, like I'm not sure if random other wikis necessarily have takeout dumps, but again, fandom-wiki and a couple other mega-wikis probably contain a majority of the interesting content, or at least a large enough amount of content you could get meaningful results.
Another interesting one would be if you could get the Internet Archive to give you "slices" of sites in a google takeout-style format. Like they already have scraped a great deal of content, so, if I want site X and the most recent non-404 versions of all pages in a given domain, it would be fantastic if they could just build that as a zip and dump it over in bulk. In fact a lot of the best technical content is no longer available on the live web unfortunately...
(did fh-reddit ever update again? or is there a way to get pushift to give you a bulk dump of everything? they stopped back in like 2019 and I'm not sure if they ever got back into it, it wasn't on bigquery last time I checked. Kind of a bummer too.)
I say exclude e-commerce because there's not a lot of informational value in knowing the 27 sites selling a video card (especially as a few megaretailers crush all the competition anyway), but there is lots of informational value in say having a copy of the sites of asus, asrock, gigabyte, MSI, etc for searching (probably don't want full binaries cached though).
But basically I think there's probably like, sub-100 TB of content that would even be useful to me if stored in some kind of relatively dense representation (reddit post/comment dumps, not pages, same for other forum content, etc, stored on a gzip level5 filesystem or something). That's easily within reach of a small server, not sure if pagerank would work as well without all the "noise" linking into it and telling you where the signal is, but I think that's well within typical r/datahoarder level builds. And you could dynamically augment that from live internet and internet archive as needed - just treat it as an ever-growing cache and index your hoard.
It sounds like CommonCrawl: https://commoncrawl.org/the-data/get-started/
You can download it, put it into ClickHouse, and get your own professional search engine.
I've made up the term "professional search engine". It's something like Google, but: - accessible by a few people, not publicly available; - does not have a sophisticated ranking or quorum pruning and simply gives your all the matched results; - queries can be performed in SQL, and the results additionally aggregated and analyzed; - full brute-force search is feasible.
PS. Yes, the Reddit dataset stopped updating.
With a source corpus that's high enough quality, you probably don't even need pagerank, just regular full-text search like Lucene would be enough.
Sounds crazy hard, a lot of moving parts, both human and technical. But if pulled off right I’d pay for that kind of thing, preferably hosted and cared for me in a datacenter
I'd like a browser plugin which allows me to vote up or down websites. This would go to a central community database. Highly upvoted sites would be crawled and archived, and available through a specialized search page.
3 replies →
I don't know about youtube, but duckduckgo has showed me a forum among the top results just two days ago ?