← Back to context

Comment by jorvi

19 days ago

Current LLMs often produce much, much worse results than manually searching.

If you need to search the internet on a topic that is full of unknown unknowns for you, they're a pretty decent way to get a lay of the land, but beyond that, off to Kagi (or Google) you go.

Even worse is that the results are inconsistent. I can ask Gemini five times at what temperature I should take a waterfowl out of the oven, and get five different answers, 10°C apart.

You cannot trust answers from an LLM.

> I can ask Gemini five times at what temperature I should take a waterfowl out of the oven, and get five different answers, 10°C apart.

Are you sure? Both Gemini and ChatGPT gave me consistent answers 3 times in a row, even if the two versions are slightly different.

Their answers are inline with this version:

https://blog.thermoworks.com/duck_roast/

  • What do you mean, "are you sure"? I literally saw and see it happen in front of my eyes. Just now tested it with slight variations of "ideal temperature waterfowl cooking", "best temperature waterfowl roasting", etc. and all these questions yield different answers, with temperatures ranging from 47c-57c (ignoring the 74c food safety ones).

    That's my entire point. Even adding an "is" or "the" can get you way different advice. No human would give you different info when you ask "what's the waterfowl's best cooking temperature" vs "what is waterfowl's best roasting temperature".

    • Did you point that out to one of them… like “hey bro, I’ve asked y’all this question in multiple threads and get wildly different answers. Why?”

      And the answer is probably because there is no such thing as an ideal temperature for waterfowl because the answer is “it depends” and you didn’t give it enough context to better answer your question.

      Context is everything. Give it poor prompts, you’ll get poor answers. LLMs are no different than programming a computer or anything else in this domain.

      And learning how to give good context is a skill. One we all need to learn.

      2 replies →

I created an account just to point out that this is simply not true. I just tried it! The answers were consistent across all 5 samples with both "Fast" mode and Pro (which I think is really important to mention if you're going to post comments like this - I was thinking maybe it would be inconsistent with the Flash model)

  • Unfortunately, despite your account creation it remains true that this happened. Just tested it again and got different answers.

It obviously takes discipline, but using something like Perplexity as an aggregator typically gets me better results, because I can click through to the sources.

It's not a perfect solution because you need the discipline/intuition to do that, and not blindly trust the summary.

Did you actually ask the model this question or are you fully strawmanning?

  • My mother did, for Christmas. It was a goose that ended up being raw in a lot of places.

    I then pointed out this same inconsistency to her, and that she shouldn't put stock in what Gemini says. Testing it myself, it would give results between 47c-57c. And sometimes it would just trip out and give the health-approved temperature, which is 74c (!).

    Edit: just tested it again and it still happens. But inconsistency isn't a surprise for anyone who actually knows how LLMs work.

    • https://imgur.com/a/qYmznHa

      I just asked gemini 3 5 times: `what temperature I should take a waterfowl out of the oven`

      and received generic advice every single time it gave nearly identical charts. 165F was in every response. LLMs are unpredictable yes. But I am more skeptical it would give incorrect answers (raw goose) rather than your mother preparing the fowl wrong.

      Cooking correctly is a skill, just as prompting is. Ask 10 people how to cook fowl and their answers will mimic the LLM.

    • > But inconsistency isn't a surprise for anyone who actually knows how LLMs work

      Exactly. These people saying they've gotten good results for the same question aren't countering your argument. All they're doing is proving that sometimes it can output good results. But a tool that's randomly right or wrong is not a very useful one. You can't trust any of its output unless you can validate it. And for a lot of the questions people ask of it, if you have to validate it, there was no reason to use the LLM in the first place.