Comment by jawns
4 months ago
If you already have your categories defined, you might even be able to skip a step and just compare embeddings.
I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.
Then created embeddings for the call notes and matched to closest category using cosine_similarity.
I originally settled on doing this, but the problem is that you have to re-calculate everything if you ever add/remove a category. If your categories will always be static, that will work fine. But it's more than likely you'll eventually have to add another category down the line.
If your categories are dynamic, the way OP handles it will be much cheaper as the number of tweets (or customer service calls in your case) grows, as long as the cache hit rate is >0%. Each tweet will get it's own label, i.e. "joke_about_bad_technology_choices". Each of these labels gets put into a category, i.e. "tech_jokes". If you add/remove a category you would still need to re-calculate everything, however you would only need to re-calculate the labels to categories as opposed to every single tweet. Since similar tweets can share the same labels, you end up with less labels than total amount of tweets. As you reach the asymptotic ceiling, as mentioned in OPs post, your cost to re-embed labels to categories also becomes an asymptotic ceiling.
If the number of items you're categorizing is a couple thousand at most and you rarely add/remove categories, it's probably not worth the complexity. But in my case (and ops) it's worth it as the number of items grows infinitely.
I had this same idea mid 2024, but embeddings and cosine similarity is way less consistent, not even the classical king+woman=queen work. The latest embedding models from OpenAI are from like 2023. Did you actually try this? What embedding models work for this?
How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
sentence embedding models like all-MiniLM-L6-v2 [1], bge-m3 [2]
[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
[2] https://huggingface.co/BAAI/bge-m3
In my recent project I used openai's embedding model for that because of its convenient api and low cost.
Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.
Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.
sentence embedding models are great for this type of thing.
Out of curiosity, what embedding model did you use for this?
This works in a pinch but is much less reliable than using a curated set of representative examples from each targeted class.
That was my first thought, why even generate tags? Curious to see if anyone's proved it's worse empirically though.
In a recent project I was asked to create a user story classifier to identify whether stories were "new development" or "maintenance of existing features". I tried both approaches, embeddings + cosine distance vs. directly asking a language model to classify the user story. The embeddings approach was, despite being fueled by the most powerful SOTA embedding model available, surprisingly worse than simply asking GPT 4.1 to give me the correct label.
OP here. It depends what you use it for. You do want the tags if you intend to generate data. Let's say you prompt an LLM to go tweet on your behalf for a week, having the ability to:
- Fetch a list of my unique tags to get a sense of my topics of interests
- Have the AI dig into those specific niches to see what people have been discussing lately
- Craft a few random tweets that are topic-relevant and present them to me to curate
Is very powerful workflow that is hard to deliver on without the class labels.