← Back to context

Comment by bambax

7 hours ago

With all due respect and while wishing you best of luck, it's always a bit worrisome when generative AI is used in the real world with real consequences...

In my experience, what LLMs, even some of the most advanced ones (o1, Gemini 1.5) are really good at is rationalization after the fact: explaining why they were right, even when presented with direct evidence to the contrary.

I just ran an experiment trying to get various models put footnote references in the OCR of a text, based on the content of the footnotes. I tested 120+ different models via OpenRouter; they all failed, but the "best" ones failed in a very bizarre and I think, dangerous way: they made up some text to better fit the footnote references! And then they lied about it, saying in a "summary" paragraph that no text had been changed, and/or that they had indeed been able to place all references.

So I guess my question is: how do you detect and flag hallucinations?

This is a really good point, but we don't think hallucinations pose a significant risk to us. You can think of Fresco like a really good scribe; we're not generating new information, just consolidating the information that the superintendent has already verbally flagged as important.

  • This is the wrong response. It doesn't matter whether you've asked it to summarize or to produce new information, hallucinations are always a question of when, not if. LLMs don't have a "summarize mode", their mode of operation is always the same.

    A better response would have been "we run all responses through a second agent who validates that no content was added that wasn't in the original source". To say that you simply don't believe hallucinations apply to you tells me that you haven't spent enough time with this technology to be selling something to safety-critical industries.

  • This seems odd. If your scribe can lie in complex and sometimes hard to detect ways, how do you not see some form of risk? What happens when (not if) your scribe misses something and real world damages ensue as a result? Are you expecting your users to cross check every report? And if so, what’s the benefit of your product?

    • We rely on multimodal input: the voiceover from the superintendent, as well as the video input. The two essentially cross check one another, so we think the likelihood of lies or hallucinations is incredibly low.

      Superintendents usually still check and, if needed, edit/enrich Fresco’s notes. Editing is way faster/easier than generating notes net new, so even in the extreme scenario where a supe needs to edit every single note, they’re still saving ~90% of the time it’d otherwise have taken to generate those notes and compile them into the right format.

      1 reply →

It has to be the same as all AI: you need someone thorough to check what it did.

LLM generated code needs to be read line by line. It is still useful to do that with code because reading is faster than googling then typing.

You can't detect hallucinations in general.

  • A (costly) way is to compare responses from different models, as they don't hallucinate in exactly the same way.

The process you described is very far from how companies who productize LLMs use them.

Honestly this is a very nitpicky argument. The issue for site contractors is not with manually checking each entry to ensure it's correct or not. It's writing the stuff down in the first place.

I'm exploring a similar but unrelated use case for generative AI, and in discovery interviews, what I learnt was that site contractors and engineers do not request or expect 100% accuracy, and leave adequate room for doubt. For them, it's the hours and hours of manually writing down a TON of paperwork, which in some industries is often months and months of work written by some of the poorest communicators on the planet. Because these tasks end up consuming so much time, they forgo the correct methodology and some even tend to fill up some reports with random bullshit just so that the project moves forward - in most cases, this writing work is done for liability concerns as mentioned above, rather than for the purposes of someone actually going through it. If the writing part is cleared for many of these guys, most wouldn't have a problem with the reading and correcting part.

  • It's unclear how filling reports with "random bullshit" will protect anyone from liability... It seems you're saying that the current situation is so bad that anything different would be an improvement, and less-random bs is better than outright bs.

    I'm sorry if my comment came across as nitpicky; it's just that every time I try to do some actual work with LLMs (that's not pure creativity, where hallucination is a feature) it never follows prompts exactly, and goes fast off the rails. In the context of construction work, that sounded dangerous. But happy to be proved wrong.

    • > It's unclear how filling reports with "random bullshit" will protect anyone from liability... It seems you're saying that the current situation is so bad that anything different would be an improvement, and less-random bs is better than outright bs.

      Exactly. Oftentimes reports are filled with nonsensical documentation that are only discovered during the discovery process of litigation after a disaster has already happened. For example, from a real safety report at a chemicals facility, there was an instance of a report stating that under high valve pressure "many bad things will happen". Not joking, literally quoted verbatim.

      Most companies' legal teams would love to have their engineers write proper documents and most engineers would love to not spend time on documentation. GenAI can fill that gap by at least giving a baseline starting point which can be edited further for a fraction of the time than writing from scratch.