← Back to context

Comment by thaumasiotes

9 months ago

Take two documents.

Feed one through an LLM, one word at a time, and keep track of words that experience greatly inflated probabilities of occurrence, compared to baseline English. "For" is probably going to maintain a level of likelihood close to baseline. "Engine" is not.

Do the same thing for the other one.

See how much overlap you get.

Wouldn't a simple comparison of the word frequency in my text against a list of usual word frequencies do the trick here without an LLM? Sort of a BM25?

  • It might; it's not going to do the same thing. The LLM will tell you words that would likely appear in a similar text. Word frequency will tell you words that have actually appeared in your text. I'm postulating that the first kind of list is much more likely to show strong overlap between two similar documents than the second kind of list.

    Vocabulary style matters a lot to what words are actually used, but much less to what words are likely to be used. If I'm following a style guide that says to use "automobile" instead of "car", appearance probabilities for "automobile" will be greatly inflated. And appearance probabilities for "car" will also be greatly inflated, just to a lesser extent than for "automobile". Whereas actual usage of "car" will be pegged at zero.

    Determining how similar two texts are is something that an LLM should be good at. It should be better than a simple comparison of word frequency. Whether it's better enough to justify the extra compute is a different question.