Comment by awendland

1 year ago

Following @isoprohplex, I'll be the fourth comment to say I also built a variant of this: https://hnss.alexwendland.com/

I built mine on top of an RSS feed I generate from Hacker News which filters out any posts linking to the top 1 million domains [1] and creates a readable version of the content. I use it to surface articles on smaller blogs/personal websites—it's become my main content source. It's generated via Github Actions every 4 hours and stored in a detached branch on Github (~2 GB of data from the past 4 years). Here's an example for posts with >= 10 upvotes [2].

It only took several hours to build the semantic search on top. And that included time for me to try out and learn several different vector DBs, embedding models, data pipelines, and UI frameworks! The current state of AI tooling is wonderfully simple.

In the end I landed on (selected in haste optimizing for developer ergonomics, so only a partial endorsement):

  - BAAI/bge-small-en as an embedding model
  - Python with
    - HuggingFaceBgeEmbeddings from langchain_community for creating embeddings
    - SentenceSplitter from llama_index for chunking documents
    - ChromaDB as a vector DB + chroma-ops to prune the DB
    - sqlite3 for metadata
    - FastAPI, Pydantic, Jinja2, Tailwind for API and server-rendered webpages
  - jsdom and mozilla-readability for article extraction

I generated the index locally on my M2 Mac which ripped through the ~70k articles in ~12 hours to generate all the embeddings.

I run the search site with Podman on a VM from Hetzner—along with other projects—for ~$8 / month. All requests are handled on CPU w/o calls to external AI providers. Query times are <200 ms, which includes embedding generation → vector DB lookup → metadata retrieval → page rendering. The server source code is here [3].

Nice work @jnnnthnn! What you built is fast, the rankings were solid, and the summaries are convenient.

[1] https://majestic.com/reports/majestic-million

[2] https://github.com/awendland/hacker-news-small-sites/blob/ge...

[3] https://github.com/awendland/hacker-news-small-sites-website...

2 comments

awendland

jasonjmcghee 1 year ago

I would strongly consider incorporating a hybrid search strategy (keyword or bm25 etc) as currently a number of searches I tried resulted in rather surprising results

jnnnthnn 1 year ago

Fun! Thanks for sharing! It's super fast and the bias toward smaller websites really does surface interesting things. Very reminiscent of the Kagi small web site, also!