Comment by dan_h

4 months ago

This is very similar to how I've approached classifying RSS articles by topic on my personal project[1]. However to generate the embedding vector for each topic, I take the average vector of the top N articles tagged with that topic when sorted by similarity to the topic vector itself. Since I only consider topics created in the last few months, it helps adjust topics to account for semantic changes over time. It also helps with flagging topics that are "too similar" and merging them when clusters sufficiently overlap.

There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.