Comment by nomilk
4 days ago
Occasionally I'll forget a famous quote [0] so I'll describe it to an LLM but the LLM is rarely able to find it. I think it's because the description of the quote uses 'like' words, but not the exact words in the quote, so the LLM gets confused and can't find it.
Interestingly, the opposite conclusion is drawn in the TFA (the article says LLMs are quite good at identifying 'like' words, or, at least, better than the cosine method, which admittedly isn't a high bar).
[0] Admittedly, some are a little obscure, but they're in famous publications by famous authors, so I'd have expected an LLM to have 'seen' them before.
That's not how llm training and recall works at all so I'm not surprised you are not getting good results in this way. You would be much better using a conventional search engine or if you want to use an llm, use one with a search tool so it will use the search engine for you.
The problem you're encountering is not the model being unable to determine whether a quote it knows is responsive to your prompt but instead is a problem to do with recall in the model (which is not generally a task it's trained for). So it's not a similarity problem it's a recall problem.
When LLMs are trained on a particular document, they don't save a perfect copy somehow that they can fish out later. They use it to update their weights via backpropogation and are evaluated on their "sentence completion" task during the main phase of training or on a prompt response eval set during instruction fine tuning. Unless your quote is in that set or is part of the eval for the sentence completion task during the main training, there's no reason to suppose the LLM will particularly be able to recall it as it's not being trained to do that.
So what happens instead is the results of training on your quote update the weights in the model and that maybe somehow in some way that is quite mysterious results in some ability to recall it later but it's not a task it's evaluated on or trained for, so it's not surprising it's not great at it and in fact it's a wonder it can do it at all.
p.s. If you want to evaluate whether it is struggling with similarity, look up a quote and ask a model whether or not it's responsive to a given question. I.e. give it a prompt like this
Funny, I've done this a few times with good results. I guess it depends on the quote and the amount of context you give it.
This is also a case where something like perplexity might yield better results because it would try to find authoritative sources and then use the LLM to evaluate what it finds instead of relying on the LLM to have perfect recall for the quote. Which of course can fail in the exact same way my own brain is failing me (mangling words, mixing up people, etc.). It's something that works surprisingly well. I pay for Chat GPT and I don't pay for perplexity. But I find myself using that more and more.