Comment by zX41ZdbW
3 years ago
It sounds like CommonCrawl: https://commoncrawl.org/the-data/get-started/
You can download it, put it into ClickHouse, and get your own professional search engine.
I've made up the term "professional search engine". It's something like Google, but: - accessible by a few people, not publicly available; - does not have a sophisticated ranking or quorum pruning and simply gives your all the matched results; - queries can be performed in SQL, and the results additionally aggregated and analyzed; - full brute-force search is feasible.
PS. Yes, the Reddit dataset stopped updating.
No comments yet
Contribute on Hacker News ↗