Comment by pietz

4 months ago

I enjoyed reading this, but it seems overly complex and at least slightly flawed.

Why not embed all tweets, cluster them with an algorithm of your choice and have an LLM provide names for each cluster?

Cheaper, better clusters and more accurate labels.

OP here. I agree! I should've called out why I did _not_ follow that approach as many others have commented the same.

The main reason why is that I needed the classification to be ongoing. My system pulled over thousands of tweets per day and they all needed to be classified as they came for some downstream tasks.

Thus, I couldn't embed all tweets, then cluster, then ...

  • Do the labels need to be static once the system has started? If not would be interesting to relabel embedding clusters once each hits a certain critical mass of tweets, or do so somewhat continuously.

k-Means clustering works very well on embedding models such as SBERT, if you feed in 20,000 documents and ask for k=20 clusters, the clusters are pretty good -- with the caveat that the clustering engine wants to make roughly equal-sized clusters so if 5% of your articles are about Fútbol you will probably get a cluster of Fútbol but if 20% of them are about the carbon cycle you will get four clusters of carbon cycle.

There are other clustering algorithms that try to fit variable size clusters or hierarchically organized clusters which may or may not make better clusters but generally take more resources than k-Means; k-Means is getting started at 20,000 documents and others might be struggling at that point.

Having the LLM write a title for the clusters is something you can do uniquely with big LLMs and prompt engineering.

It's wrong to say "don't waste your time collecting the data to train and evaluate a model it because you can always prompt a commercial LLM and it will be 'good enough'" because you at the very least need the evaluation data to prove that your system is 'good enough' and decide if one is better than another (swap out Gemini vs Llama vs Claude)

In the end, though, you might wish that the classification is not something arbitrary that the system slapped on it but rather is a "class" in some ontology which has certain attributes (e.g. a book can have a title, and a "heavy book" weighs more than 2 pounds by my definition) If you are going the formal ontology route you need the same evaluation data so you know you're not doing it wrong. If you've accepted that, though, you might as well collect more data a train a supervised model and what I see in the literature is that many-shot approach still outperforms one-shot and few-shot.

[1] which is on the scale of the training data in most qapplications

From my limited experience trying exactly this, it gets you 80% of the way there, then devolves into an infuriating and time-wasting exercise in endless iteration and prompting to sweep clustering parameters and labeling details to nail the remaining 20% needed for acceptance by downstream "customers" (i.e., nontechnical business people).

If your end goal is to show an audience of nontechnical stakeholders an overview of your dataset in a static medium (like a slide), I would suggest you do the cluster labeling yourself, with the help of interactive tooling to make the semantic cluster structure explorable. One option is to throw the dataset into Apple's recently published and open-sourced Embedding Atlas (https://github.com/apple/embedding-atlas), take a screenshot of the cluster viz, poke around in the semantic space, and manually annotate the top 5-10 most interesting clusters right in Google Slides or PowerPoint. If you need more control over the embedding and projection steps (and you have a bit more time), write your own embedding and projection, then use something like Plotly to build a quick interactive viz just for yourself; drop a screenshot into a slide and annotate it. Feels super dumb, but is guaranteed to produce human-friendly output you can actually present confidently as part of your data story and get on with your life.