← Back to context

Comment by smusamashah

2 days ago

On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

Absolutely. LLMs are a "need to verify" the results almost always. LLMs (for me) shine by pointing me in the right direction, getting a "first draft", or for things like code where I can test it.

It is really the only safe way to use it IMHO.

Even in most simple forms of automation, humans suffer from Automation Bias and Complacency and one of the better ways to avoid those issues is to instill a fundamental mistrust of those systems.

IMHO it is important to look at other fields and the human factors studies to understand this.

As an example ABS was originally sold as a technology that would help you 'stop faster'. Which it may do in some situations, and it is obviously mandatory in the US. But they had to shift how they 'sell' it now, to ensure that people didn't rely on it.

https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/200...

    2.18 – Antilock Braking Systems (ABS)

    ABS is a computerized system that keeps your wheels from locking up during hard brake applications.
    ABS is an addition to your normal brakes. It does not decrease or increase your normal braking capability. ABS only activates when wheels are about to lock up.
    ABS does not necessarily shorten your stopping distance, but it does help you keep the vehicle under control during hard braking.

Transformers will always produce code that doesn't work, it doesn't matter if that is due to what they call hallucinations, Rice's theory, etc...

Maintaining that mistrust is the mark of someone who understands and can leverage the technology. It is just yet another context specific tradeoff analysis that we will need to assess.

I think forcing people into the quasi-TDD thinking model, where they focus on what needs to be done first vs jumping into the implementation details will probably be a positive thing for the industry, no matter where on the spectrum LLM coding assistants arrive.

That is one of the hardest things to teach when trying to introduce TDD, focusing on what is far closer to an ADT than implementation specific unit tests to begin with is very different but very useful.

I am hopeful that required tacit experience will help get past the issues with formal frameworks that run into many barriers that block teaching that one skill.

As LLM's failure mode is Always Confident, Often Competent, and Inevitably Wrong, it is super critical to always realize the third option is likely and that you are the expert.

Agree. My biggest pain point with LLM code review tools is that they sometimes add 40 comments for a PR changing 100 lines of code. Gets noisy and hard to decipher what really matters.

Along the lines of verifiability, my take is that running a comprehensive suite of tests in CI/CD is going to be table stakes soon given that LLMs are only going to be contributing more and more code.

> On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

I'm working on a LLM chat app that is built around mistrust. The basic idea is that it is unlikely a supermajority of quality LLMs can get it wrong.

This isn't foolproof though, but it does provide some level of confidence in the answer.

Here is a quick example in which I analyze results from multiple LLMs that answered, "When did Homer Simpson go to Mars?"

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

If you look at the yes and no table, all except GPT-4o and GPT-4o mini said no. After asking GPT-4o who was correct, it provided "evidence" on an episode so I asked for more information on that episode. Based on what it said, it looks like the mission to Mars was a hoax and when I challenged GPT-4o on this, it agreed and said Homer never went to Mars, like others have said.

I then asked Sonnet 3.5 about the episode and it said GPT-4o misinterpreted the plot.

https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...

At this point, I am confident (but not 100% sure) Homer never went to Mars and if I really needed to know, I'll need to search the web.

  • It's the backwards reasoning that really frustrates me when using LLMs. You ask a question, it says sure do these things, they don't work out and you ask the LLM why not, and it replies yes that thing I told you to do wouldn't work because of these clear reasons.

    It would be nice to start at the end of that chain of reasoning instead of the other side.

    Another regular example is when it "invents" functions or classes that don't exist, when pressed about them, it will reply of course that won't work, that function doesn't exist.

    Okay great, so don't tell me it does with such certainty, is what I would tell a human feeding me imagination as facts all the time. But of course an LLM is not reasoning in the same sense, so this reverse chain of thought is the outcome.

    I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.

    • > I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.

      Yeah I felt the same way in the beginning which is why I ended up writing my own chat app. What I've found while developing my spelling and grammar checker is that it is very unlikely for multiple LLMs to mess up at the same time. I know they will mess up, but I'm also pretty sure they won't at the same time.

      So far, I've been able to successfully create working features that actually saved me time by pitting LLMs against their own responses and others. My process right now is, I'll ask 6+ models to implement something and then I will ask models to evaluate everyone's responses. More often than not, a model will find fault or make a suggestion that can be used to improve the prompt or code. And depending on my confidence level, I might repeat this a couple of times.

      The issue right now is tracking this "chain of questioning" which is why I am writing my own chat app. I need an easy way to backtrack and fork from different points in the "chain of questioning". I think once we get a better understanding of what LLMs can and can't do as a group, we should be able to produce working solutions easier.

  • Isn't this essentially making the point of the post above you?

    For comparison - if I just do a web search for "Did homer simpson go to mars" I get immediately linked to the wikipedia page for that exact episode (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and the plot summary is less to read than your LLM output - It clearly summarizes that Marge & Lisa (note - NOT homer) almost went to mars, but did not go. Further - the summary correctly includes the outro which does show Marge and Lisa on mars in the year 2051.

    Basically - for factual content, the LLM output was a garbage game of telephone.

    • > Isn't this essentially making the point of the post above you?

      Yes. This is why I wrote the chat app, because I mistrust LLMs, but I do find them extremely useful when you approach them with the right mindset. If answering "Did Homer Simpson go to Mars?" correctly is critical, then you can choose to require a 100% consensus, otherwise you will need a fallback plan.

      When I asked all the LLMs about the Wikipedia article, they all correctly answered "No" and talked about Marge and Lisa in the future without Homer.

  • Relatedly, asking LLMs what happens in a TV episode, or a series in general, I usually get very low quality and mostly flat out wrong answers. That baffles me, as I thought there are multiple well structured synopses for any TV series in the training data.

Yes, it is good for suumarizing existing text, explaining something or coding; in short any generative/transformative tasks. Not good for information retrieval. Having said that even tiny Qwen 3b/7b coding llms turned out to be very useful in my use experience.

You're going to fall behind eventually, if you continue to treat LLMs with this level of skepticism, as others won't, and the output is accurate enough that it can be useful to improve the efficiency of work in a great many situations.

Rarely are day-to-day written documents (e.g. an email asking for clarification on an issue or to schedule an appointment) of such importance that the occasional error is unforgivable. In situations where a mistake is fatal, yes I would not trust GenAI. But how many of us really work in that kind of a field?

Besides, AI shines when used for creative purposes. Coming up with new ideas or rewording a paragraph for clarity isn't something one does blindly. GenAI is a coworker, not an authority. It'll generate a draft, I may edit that draft or rewrite it significantly, but to preclude it because it could error will eventually slow you down in your field.

  • You’re narrowly addressing LLM use cases & omitting the most problematic one - LLMs as search engine replacements.

    • That's the opposite of problematic, that's where an LLM shines. And before you say hallucination, when was the last time you didn't click the link in a Google search result? It's user error if you don't follow up with additional validation, exactly as you would with Google. With GenAI it's simply easier to craft specific queries.

We need a hallucination benchmark.

My experience is, o1 is very good at avoiding hallucinations and I trust it more, but o1-mini and 4o are awful.

  • Well given the price $15.00 / 1M input tokens and $60.00 / 1M output* tokens, I would hope so. Given the price, I think it is fair to say it is doing a lot of checks in the background.

    • It is expensive. But if I'm correct about o1, it means user mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1 (or better) models as their daily driver.

      2 replies →