Comment by Xmd5a
21 hours ago
Indeed, but consider this situation: You have a collection of documents and want to extract the first n words because you're interested in the semantic content of the beginning of each doc. You use a LLM because why not. The LLM processes the documents, and every now and then it returns a slightly longer or shorter list of words because it better captures the semantic content. I'd argue the LLM is in fact doing exactly the right thing.
Let me hammer that nail deeper: your boss asks you to establish the first words of each document because he needs this info in order to run a marketing campaign. If you get back to him with a google sheet document where the cells read like "We the" or "It is", he'll probably exclaim "this wasn't what I was asking for, obviously I need the first few words with actual semantic content, not glue words. And you may rail against your boss internally.
Now imagine you're consulting with a client prior to developing a digital platform to run marketing campaigns. If you take his words literally, he will certainly be disappointed by the result and arguing about the strict formal definition of "2 words" won't make him deviate from what he has to say.
LLMs have to navigate through pragmatics too because we make abundant use of it.
Good explanation. That's most likely the reason for it.
At the same time it's what I don't like with most modern search functions: they won't allow you to search for exact words or sentences. It doesn't work on google, last time I played around with elasticsearch it didn't work, and it happens in many other places.
Obviously if you want performance you need to group common words and ignore punctuation. But if you're doing code search for actual strings (like on github) it's a totally different problem.
Would be nice to have a google-like search index that you can query with regexp.