← Back to context

Comment by llm_trw

15 days ago

This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.

  • > It does seem

    And therein lies all the problem. The verification required for serious work is likely orders of magnitude more than anybody is willing to spend on.

    For example, professional OCR companies have large teams of reviewers who double or triple review everything, and that is after the software itself flags recognition with varying degrees of certainty. I don't think companies are thinking of LLMs as tools that require that level of dedication and resources, in virtually all larger scale use cases.

    • This seems to be exactly the business model of myriad recent YC startups. It seemingly did work for casetext as an example.

  • In some cases this is true, but then why choose an expensive world model over a small net or random forest you trained specifically for the task at hand?

we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses

> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.

  • While humans can and do make mistakes, it seems to me like there is a larger problem here that LLMs make mistakes for different reasons than humans and that those reasons make them much worse than humans at certain types of problems (e.g. OCR). Worse, this weakness might be fundamental to LLM design rather than something that can be fixed by just LLM-ing harder.

    I think a lot of this gets lost in the discussion because people insist on using terminology that anthropomorphizes LLMs to make their mistakes sound human. So LLMs are "hallucinating" rather than having faulty output because their lossy, probabilistic model fundamentally doesn't actually "understand" what's being asked of it the way a human would.

    • This is what a lot of people miss. We have thousands of years of understanding the kinds of mistakes that humans make; we only have months to years of experience with the mistakes that LLM's and other AI's make.

      This means that most of our verification and testing processes won't inherently catch AI errors because they're designed to catch human errors. Things like "check to see if the two sides of these transactions sum to 0" are fine for human typos, but they won't catch a fake (yet accurately entered) transaction.

      It's similar to a language barrier. You don't realize how much you rely on context clues until you spend 3 days of emails trying to communicate a complex topic to someone in their second language.

      1 reply →

  • Human beings also have an ability to doubt their own abilities and understanding. In the case of transcription, if someone has doubts about what they are transcribing they'll try and seek clarity instead of just making something up. They also have the capacity to know if something is too important to screw up and adjust their behaviour appropriately.