← Back to context

Comment by PaulHoule

4 months ago

k-Means clustering works very well on embedding models such as SBERT, if you feed in 20,000 documents and ask for k=20 clusters, the clusters are pretty good -- with the caveat that the clustering engine wants to make roughly equal-sized clusters so if 5% of your articles are about Fútbol you will probably get a cluster of Fútbol but if 20% of them are about the carbon cycle you will get four clusters of carbon cycle.

There are other clustering algorithms that try to fit variable size clusters or hierarchically organized clusters which may or may not make better clusters but generally take more resources than k-Means; k-Means is getting started at 20,000 documents and others might be struggling at that point.

Having the LLM write a title for the clusters is something you can do uniquely with big LLMs and prompt engineering.

It's wrong to say "don't waste your time collecting the data to train and evaluate a model it because you can always prompt a commercial LLM and it will be 'good enough'" because you at the very least need the evaluation data to prove that your system is 'good enough' and decide if one is better than another (swap out Gemini vs Llama vs Claude)

In the end, though, you might wish that the classification is not something arbitrary that the system slapped on it but rather is a "class" in some ontology which has certain attributes (e.g. a book can have a title, and a "heavy book" weighs more than 2 pounds by my definition) If you are going the formal ontology route you need the same evaluation data so you know you're not doing it wrong. If you've accepted that, though, you might as well collect more data a train a supervised model and what I see in the literature is that many-shot approach still outperforms one-shot and few-shot.

[1] which is on the scale of the training data in most qapplications