Show HN: My recommendation engine for Hacker News
2 years ago (hn-recommend.julienc.me)
Hi! I’m Julien and I built a recommendation engine for Hacker News.
I feel like this website is a gold mine. Every day, I find some very interesting stories about a topic. And sometimes, I want to find other stories covering that same topic but I can’t.
Hacker News has years of history of awesome discussion and ressources. Unfortunately, I think HN Algolia isn’t helpful in searching these old threads. As a student, I want to learn a lot from this website.
This is why I created HN Recommend. Input a sentence or the URL of an article, and get the most popular and similar posts from Hacker News.
About the technical details, I've computed the embeddings of over 100,000 articles from HN and indexed it using Faiss. I made a blog post for a deeper explanation.
Source code: https://github.com/julien040/hn-recommendation-api
Article: https://julienc.me/articles/Extract_embeddings_Hacker_News_a...
Project: https://hn-recommend.julienc.me
Aww, thank you for using my Memory Allocation post as the placeholder text. <3
I often wish I could sort Hacker News into two categories. Actual software/tech/STEM and everything else. I think both are interesting, but often, the niche tech stuff gets drowned out fast. So this is great for that :-)
I just released a new update, thanks to everyone's feedback. Now, you can sort results by relevancy, age, or score using the select.
This is a joy to use and also ot fits very nicely with the other highly ranked post by Nielsen group. Kudos!
Which other post? There is so much churn on HN that it’s hard to know which post you are referring to
Likely https://news.ycombinator.com/item?id=36394569
1 reply →
This is great. I often come across some HN post on a topic I am interested in and then want to go look at other posts in the same topic cluster to expand my exposure. This looks awesome for that.
I don't know if it would be useful or even work, but is it possible to let the user adjust the vector distance threshold and then apply the other sorting parameters to the results? Eg. if I want to go broader, but then sort by high score or something so I see popular posts within an expanded (but still relevant) cluster?
Checkout https://askhn.ai
The content is ranked by how people discuss the topics and who discusses them
If you just do embeddings on posts you might miss relevant content. When people who have knowledge of AMD discuss intel and believe that content is relevant to AMD, the content will be ranked
I thought about an algorithm with weight adjustable by the user. Now, the API returns a field with the distance between the post and the query (the square of the Euclidean distance). It's used by the interface to rank results by relevance.
Perhaps I can compute a score for each story, where each field has a weight and rank the results using this score. For example, the score could be 0.2 x score + 0.1 x comments + 1/distance - timestamp/ 10^9. The stories with the highest rank would be shown first, and the weight (0.2, 0.1, 10^9) could be adjusted by the user, as some might prefer recency while others prefer popularity.
It might be useful to pose this problem in terms of a precision vs. recall curve.
Hmm I tried searching "elixir" and found nothing related to the language. HN Algolia gives me exactly what I want. On what basis do you say it's "not helpful"?
Yes the search doesn't work very well for one word. Try to input an url about elixir like this: https://hn-recommend.julienc.me/?q=https%3A%2F%2Fnews.ycombi...
I may have used the incorrect term. HN Algolia is effective for searching for a particular story. However, I am unable to utilize it to find related posts on the same topic that do not contain the same words.
Out of curiosity related to the word vectorization algorithm...why does one word not perform as well? Whats the cause/rationale?
1 reply →
hey Julien. I love the product but the search doesn't seem to be doing the best for me. For example, I looked up Tailwind and got plenty of results but none of them actually involved Tailwind.
Maybe a tagging solution is the way? if you determine a set amount of popular keywords for a topic and filter around those, you can offer more relevant results. With some sort of public tagging system you can also have SEO friendly pages around tags and get people browsing stuff they wouldn't normally search for.
At first, the website concept focused on getting posts similar to a URL. Querying with text didn't yield relevant results.
Your solution appears better suited for this use case. Thank you.
What I really need for HN (and any other news feed for that matter) is something like "google discover" i.e. a content-based recommendation system with some sort of feedback mechanism.
So I would get relevant information to me (I can skip, visit, like, dislike) whether or not it's popular. That last point is important because HN home page doesn't give you that, and most of posts could get lost in oblivion just because the first few folks did not find it interesting.
HN needs a simple feature: a weekly digest view that shows the top 30 most commented posts (it should completely ignore flags and votes).
You mean like the one that's emailed to me every week?
https://hackernewsletter.com
Thanks, I was considering something like this as I used ITTT to send me weekly top threads from certain subreddits, but now with Reddit going south…
Pls sort by recency. Otherwise you see 13 year old articles most of them obsolete/irrelevant to the current situation.
By sorting by recency, I was worried I would get less revelant results. Perhaps I should add a thresold to not have too old posts
You can now sort by recency. I hope this helps.
1 reply →
Love it.
This response is very reactive heavy, where as it’s elixir I’m more interested in.
But well done on the execution. It does exactly what it states.
I’ve bookmarked.
I often search HN for additional articles and discussions based on something I’ve just read. Next time I’ll use this tool.
Great project. I learned about the faiss library. Out of curiousity, did you also try it with doc2vec?
I didn't try Doc2Vec. I wanted a hosted solution because I wouldn't have been able to compute all this locally (more than 100,000 posts).
If you tried it, did you have great results with? I may use it in future projects.
Yes, I am using it on a not so small dataset (roughly 1 million docs) and the output is a fairly efficient model. I am using gensim with pre-trained word vectors. New docs can be inferred via .infer_vector().
Overall my approach is less automated than what I have seen in your codebase so it’s likely a bigger investment. I am happy to share more.
1 reply →
The blog post link on GitHub was a nice walk through of your method and I was interested in what you think the hit rate was for getting successful text for embeddings from TFA links. 100K is a good sized corpus but wondering how many got skipped due to paywalls or 404 links or any other problems ?
1 reply →
A comment about search results: "design system" is related to design, "system design" relates to computing
It seems search takes the two inputs as the same.
Also, search doesn't seem to work when using just 1 word.
Yes it's an issue. Sadly, I can't fix it. I'm using the closed source "text-embedding-ada-002" model from OpenAI.
As I can see, the longer the input, the more accurate the results. Perhaps you can try something longer, like "What is a design system for UI?"
Yes, adding context helps.
Thanks!
This is amazing, thank you for this. Makes finding stuff a lot easier
i like the idea of this but wont remember it because my muscle memory is tuned to news.ycombinator.com. perhaps i can recommend a chrome extension instead of a website?
Thank you for suggesting this.
The API is already made and can be found at https://github.com/julien040/hn-recommendation-api. I don't think it would be too difficult to build a Chrome extension that fetches it.
An iOS share widget would be cool too. Since you support putting the input text in the URL, then maybe someone can make a Workflow for it and share it here.
4 replies →
This URL fails
https://hn-recommend.julienc.me/?q=Go
Oops, on the API side, there is a check to ensure the text is long enough (5 characters), but I forgot to add this check client-side. Thank you for pointing out the issue.
Try this https://hn-recommend.julienc.me/?q=Golang if you want stories related to Go.
Edit: add link
i didn't expect the embeddings have such simple yet useful application, thanks!
One feature I would like for an Recommender Systems to have is : explicit ability to jump in and out of filter bubbles or research rabbit holes. Another example would be, put yourself in the shoes of another, e.g. what content is liked by game developers generally. apart from general gamedev content, what do they like, where do they take inspiration from, etc.
I remember there was a project built on instagram which allowed a person to view instagram as it looked like to a particular celebrity.
I'm a bit divided on this feature. On one hand, I would like to have this feature; it would be awesome to see the recommendation of people from different jobs. On the other hand, I'm a bit concerned about privacy. The system must ensure that each group is big enough to avoid the leak of someone's recommendations. I don't want anyone to know exactly what I'm liking and what I'm watching.
If I recall correctly, myCANAL (the French Netflix) used to have a similar feature. You could access the recommendations of personalities of the channel, but it was curated manually.
I search for a url I know was posted and it doesn't show it. It shows unrelated articles.
The data is a few weeks old. Do you know when the URL was published?
It's 10 years old.
This search query https://hn-recommend.julienc.me/?q=paul%20graham returns articles that are missing both words of the query
1 reply →
Nit:
> Resources to learn about distributed systems
I thought Murat Buffalo's blog would come up at the top. That's a gold, and I'm confident that it was shared on HN as well (maybe a year or two back).
Otherwise neat and useful!
The layout is currently buggy on Firefox.
Hi, are you talking about a problem like this one? https://cln.sh/MFG3DPZn+
Yeah, when there’s no thumbnail.
1 reply →
A time filter is needed