Comment by derbaum
12 hours ago
One of the things I'm still struggling with when using LLMs over NLP is classification against a large corpus of data. If I get a new text and I want to find the most similar text out of a million others, semantically speaking, how would I do this with an LLM? Apart from choosing certain pre-defined categories (such as "friendly", "political", ...) and then letting the LLM rate each text on each category, I can't see a simple solution yet except using embeddings (which I think could just be done using BERT and does not count as LLM usage?).
I've used embeddings to define clusters, then passed sampled documents from each cluster to an LLM to create labels for each grouping. I had pretty impressive results from this approach when creating a category/subcategory labels for a collection of texts I worked on recently.
That's interesting, it sounds a bit like those cluster graph visualisation techniques. Unfortunately, my texts seem to fall into clusters that really don't match the ones that I had hoped to get out of these methods. I guess it's just a matter of fine-tuning now.
Take two documents.
Feed one through an LLM, one word at a time, and keep track of words that experience greatly inflated probabilities of occurrence, compared to baseline English. "For" is probably going to maintain a level of likelihood close to baseline. "Engine" is not.
Do the same thing for the other one.
See how much overlap you get.
Wouldn't a simple comparison of the word frequency in my text against a list of usual word frequencies do the trick here without an LLM? Sort of a BM25?
It might; it's not going to do the same thing. The LLM will tell you words that would likely appear in a similar text. Word frequency will tell you words that have actually appeared in your text. I'm postulating that the first kind of list is much more likely to show strong overlap between two similar documents than the second kind of list.
Vocabulary style matters a lot to what words are actually used, but much less to what words are likely to be used. If I'm following a style guide that says to use "automobile" instead of "car", appearance probabilities for "automobile" will be greatly inflated. And appearance probabilities for "car" will also be greatly inflated, just to a lesser extent than for "automobile". Whereas actual usage of "car" will be pegged at zero.
Determining how similar two texts are is something that an LLM should be good at. It should be better than a simple comparison of word frequency. Whether it's better enough to justify the extra compute is a different question.