Comment by mtlmtlmtlmtl
6 days ago
That's all well and good for this particular example. But in general, the verification can often be so much work it nullifies the advantage of the LLM in the first place.
Something I've been using perplexity for recently is summarizing the research literature on some fairly specific topic(e.g. the state of research on the use of polypharmacy in treatment of adult ADHD). Ideally it should look up a bunch of papers, look at them and provide a summary of the current consensus on the topic. At first, I thought it did this quite well. But I eventually noticed that in some cases it would miss key papers and therefore provide inaccurate conclusions. The only way for me to tell whether the output is legit is to do exactly what the LLM was supposed to do; search for a bunch of papers, read them and conclude on what the aggregate is telling me. And it's almost never obvious from the output whether the LLM did this properly or not.
The only way in which this is useful, then, is to find a random, non-exhaustive set of papers for me to look at(since the LLM also can't be trusted to accurately summarize them). Well, I can already do that with a simple search in one of the many databases for this purpose, such as pubmed, arxiv etc. Any capability beyond that is merely an illusion. It's close, but no cigar. And in this case close doesn't really help reduce the amount of work.
This is why a lot of the things people want to use LLMs for requires a "definiteness" that's completely at odds with the architecture. The fact that LLMs are food at pretending to do it well only serves to distract us from addressing the fundamental architectural issues that need to be solved. I think think any amount of training of a transformer architecture is gonna do it. We're several years into trying that and the problem hasn't gone away.
> The only way for me to tell whether the output is legit is to do exactly what the LLM was supposed to do; search for a bunch of papers, read them and conclude on what the aggregate is telling me. And it's almost never obvious from the output whether the LLM did this properly or not.
You're describing a fundamental and inescapable problem that applies to literally all delegated work.
Sure, if you wanna be reductive, absolutist and cynical about it. What you're conveniently leaving out though is that there are varying degrees of trust you can place in the result depending on who did it. And in many cases with people, the odds they screwed it up are so low they're not worth considering. I'm arguing LLMs are fundamentally and architecturally incapable of reaching that level of trust, which was probably obvious to anyone interpreting my comment in good faith.
I think what you're leaving is that what you're applying to people also applies to LLMs. There are many people you can trust to do certain things but can't trust to do others. Learning those ropes requires working with those people repeatedly, across a variety of domains. And you can save yourself some time by generalizing people into groups, and picking the highest-level group you can in any situation, e.g. "I can typically trust MIT grads on X", "I can typically trust most Americans on Y", "I can typically trust all humans on Z."
The same is true of LLMs, but you just haven't had a lifetime of repeatedly working with LLMs to be able to internalize what you can and can't trust them with.
Personally, I've learned more than enough about LLMs and their limitations that I wouldn't try to use them to do something like make an exhaustive list of papers on a subject, or a list of all toothpastes without a specific ingredient, etc. At least not in their raw state.
The first thought that comes to mind is that a custom LLM-based research agent equipped with tools for both web search and web crawl would be good for this, or (at minimum) one of the generic Deep Research agents that's been built. Of course the average person isn't going to think this way, but I've built multiple deep research agents myself, and have a much higher understanding of the LLMs' strengths and limitations than the average person.
So I disagree with your opening statement: "That's all well and good for this particular example. But in general, the verification can often be so much work it nullifies the advantage of the LLM in the first place."
I don't think this is a "general problem" of LLMs, at least not for anyone who has a solid understanding of what they're good at. Rather, it's a problem that comes down to understanding the tools well, which is no different than understanding the people we work with well.
P.S. If you want to make a bunch of snide assumptions and insults about my character and me not operating in good faith, be my guest. But in return I ask you to consider whether or not doing so adds anything productive to an otherwise interesting conversation.
Yup, and worse since the LLM gives such a confident sounding answer, most people will just skim over the ‘hmm, but maybe it’s just lying’ verification check and move forward oblivious to the BS.
People did this before LLMs anyway. Humans are selfish, apathetic creatures and unless something pertains to someone's subject of interest the human response is "huh, neat. I didn't know dogs could cook pancakes like that" then scroll to the next tiktok.
This is also how people vote, apathetically and tribally. It's no wonder the world has so many fucking problems, we're all monkeys in suits.
I think that’s my point. It enables exactly the worse behavior in the worst way, knowledge wise.
Sure, but there's degrees in the real world. Do people sometimes spew bullshit (hallucinate) at you? Absolutely. But LLMs, that's all they do. They make bullshit and spew it. That's their default state. They're occasionally useful despite this behavior, but it doesn't mean that they're not still bullshitting you
I'd be very interested in hearing what conclusions you came to in your research, if you're willing to share.