The "confident idiot" problem: Why AI needs hard rules, not vibe checks

6 days ago (steerlabs.substack.com)

The thing that bothers me the most about LLMs is how they never seem to understand "the flow" of an actual conversation between humans. When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

LLMs don't do this. Instead, every question is immediately responded to with extreme confidence with a paragraph or more of text. I know you can minimize this by configuring the settings on your account, but to me it just highlights how it's not operating in a way remotely similar to the human-human one I mentioned above. I constantly find myself saying, "No, I meant [concept] in this way, not that way," and then getting annoyed at the robot because it's masquerading as a human.

  • LLMs all behave as if they are semi-competent (yet eager, ambitious, and career-minded) interns or administrative assistants, working for a powerful CEO-founder. All sycophancy, confidence and positive energy. "You're absolutely right!" "Here's the answer you are looking for!" "Let me do that for you immediately!" "Here is everything I know about what you just mentioned." Never admitting a mistake unless you directly point it out, and then all sorry-this and apologize-that and "here's the actual answer!" It's exactly the kind of personality you always see bubbling up into the orbit of a rich and powerful tech CEO.

    No surprise that these products are all dreamt up by powerful tech CEOs who are used to all of their human interactions being with servile people-pleasers. I bet each and every one of them are subtly or overtly shaped by feedback from executives about how they should respond to conversation.

    • I agree entirely, and I think it's worthwhile to note that it may not even be the LLM that has that behavior. It's the entire deterministic machinery between the user and the LLM that creates that behavior, with the system prompt, personality prompt, RLHF, temperature, and the interface as a whole.

      LLMs have an entire wrapper around them tuned to be as engaging as possible. Most people's experience of LLMs is a strongly social media and engagement economy influenced design.

    • > "You're absolutely right!" "Here's the answer you are looking for!" "Let me do that for you immediately!" "Here is everything I know about what you just mentioned." Never admitting a mistake unless you directly point it out, and then all sorry-this and apologize-that and "here's the actual answer!" It's exactly the kind of personality you always see bubbling up into the orbit of a rich and powerful tech CEO.

      You may be on to something there: the guys and gals that build this stuff may very well be imbibing these products with the kind of attitude that they like to see in their subordinates. They're cosplaying the 'eager to please' element to the point of massive irritation and left out the one feature that could serve to redeem such behavior which is competence.

      7 replies →

    • Analogies of LLMs to humans obfuscates the problem. LLMs aren't like humans of any sort in any context. They're chat bots. They do not "think" like humans and applying human-like logic to them does not work.

      17 replies →

    • I don't think these LLMs were explicitly designed based on the CEO's detailed input that boils down to 'reproduce these servile yes-men in LLM form please'.

      Which makes it more interesting. Apparently reddit was a particularly hefty source for most LLMs; your average reddit conversation is absolutely nothing like this.

      Separate observation: That kind of semi-slimey obsequious behaviour annoys me. Significantly so. It raises my hackles; I get the feeling I'm being sold something on the sly. Even if I know the content in between all the sycophancy is objectively decent, my instant emotional response is negative and I have to use my rational self to dismiss that part of the ego.

      But I notice plenty of people around me that respond positively to it. Some will even flat out ignore any advice if it is not couched in multiple layers of obsequious deference.

      Thus, that raises a question for me: Is it innate? Are all people placed on a presumably bell-curve shaped chart of 'emotional response to such things', with the bell curve quite smeared out?

      Because if so, that would explain why some folks have turned into absolute zealots for the AI thing, on both sides of it. If you respond negatively to it, any serious attempt to play with it should leave you feeling like it sucks to high heavens. And if you respond positively to it - the reverse.

      Idle musings.

      8 replies →

    • The problem with these LLM chat-bots is they are too human, like a mirror held up to the plastic-fantastic society we have morphed into. Naturally programmed to serve as a slave to authority, this type of fake conversation is what we've come to expect as standard. Big smiles everyone! Big smiles!!

      3 replies →

    • There is sort of the opposite problem as well, as the top comment was saying, where it can super confidently propose that its absolutely right and you're wrong instead of asking questions to try and understand what you mean.

    • > LLMs all behave as if they are semi-competent

      Only in the same way that all humans behave the same.

      You can prompt an LLM to talk to you however you want it to, it doesn't have to be nice to you.

    • Looking forward to living in a society where everyone feels like they’re CEOs.

    • Isn’t it kind of true that the systems we as servile people-pleasers have to operate out of are exactly these? The hierarchical status games and alpha-animal tribal dynamics are these. Our leaders who are so might and rich and powerful want to keep their position, and we don’t want to admit they have more influence than we do for things like AI now and so we stand and watch naively as they reward the people pleasers and eventually historically we learn(ed) it pays to please until leadership changes.

    • > LLMs all

      Sounds like you don't know how RLHF works. Everything you describe is post-training. Base models can't even chat, they have to be trained to even do basic conversational turn taking.

      2 replies →

    • This is partly true, partly false, partly false in the opposite direction, with various new models. You really need to keep updating and have tons of interactions regularly in order to speak intelligently on this topic.

      2 replies →

  • > LMs don't do this. Instead, every question is immediately responded with extreme confidence with a paragraph or more of text.

    Having just read a load of Quora answers like this, which did not cover the thing I was looking for, that is how humans on the internet behave and how people have to write books, blog posts, articles, documentation. Without the "dance" to choose a path through a topic on the fly, the author has to take the burden of providing all relevant context, choosing a path, explaining why, and guessing at any objections and questions and including those as well.

    It's why "this could have been an email" is a bad shout. The summary could have been an email, but the bit which decided on that being the summary would be pages of guessing all the things which what might have been in the call and which ones to include or exclude.

    • This is a recent phenomenon. It seems most of the pages today are SEO optimized LLM garbage with the aim of having you scroll past three pages of ads.

      THe internet really used to be efficient and i could always find exactly what i wanted with an imprecise google search ~ 15 years ago.

      10 replies →

    • Interesting. Like many people here, I've thought a great deal about what it means for LLMs to be trained on the whole available corpus of written text, but real world conversation is a kind of dark matter of language as far as LLMs are concerned, isn't it? I imagine there is plenty of transcription in training data, but the total amount of language use in real conversational surely far exceeds any available written output and is qualitatively different in character.

      This also makes me curious to what degree this phenomenon manifests when interacting with LLMs in languages other than English? Which languages have less tendency toward sycophantic confidence? More? Or does it exist at a layer abstracted from the particular language?

    • That's part of it, but I think another part is just the way the LLMs are tuned. They're capable of more conversational tones, but human feedback in post-training biases them toward a writing style that's more of a Quora / StackOverflow / Reddit Q&A style because that's what gets the best ratings during the RLHF process.

  • Yes you're totally right! I misunderstood what you meant, let me write six more paragraphs based on a similar misunderstanding rather than just trying to get clarification from you

    • My favorite is when it bounces back and forth between the same two wrong answers, each time admitting that the most recent answer is wrong and going back to the previous wrong answer.

      Doesn't matter if you tell it "that's not correct and neither is ____ so don't try that instead," it likes those two answers and it's going to keep using them.

      6 replies →

    • Once the context is polluted with wrong information, it is almost impossible to get it right again.

      The only reliable way to recover is to edit your previous question to include the clarification, and let it regenerate the answer.

  • ChatGPT offered a "robotic" personality which really improved my experience. My frustrations were basically decimated right away and I quickly switched to a more "You get out of it what you put in" mindset.

    And less than two weeks in they removed it and replaced it with some sort of "plain and clear" personality which is human-like. And my frustrations ramped up again.

    That brief experiment taught me two things: 1. I need to ensure that any robots/LLMs/mech-turks in my life act at least as cold and rational as Data from Star Trek. 2. I should be running my own LLM locally to not be at the whims of $MEGACORP.

    • > I should be running my own LLM

      I approve of this, but in your place I'd wait for hardware to become cheaper when the bubble blows over. I have a i9-10900, and bought an M.2 SSD and 64GB of RAM in july for it, and get useful results with Qwen3-30B-A3B (some 4-bit quant from unsloth running on llama.cpp).

      It's much slower than an online service (~5-10 t/s), and lower quality, but it still offers me value for my use cases (many small prototypes and tests).

      In the mean time, check out LLM service prices on https://artificialanalysis.ai/ Open source ones are cheap! Lower on the homepage there's a Cost Efficiency section with a Cost vs Intelligence chart.

      1 reply →

    • Sort of a personal modified Butlerian Jihad? Robots / chatbots are fine as long as you KNOW they're not real humans and they don't pretend to be.

  • I never expected LLMs to be like an actual conversation between humans. The model is in some respects more capable and in some respects more limited than a human. I mean, one could strive for an exact replica of a human -- but for what purpose? The whole thing is a huge association machine. It is a surealistic inspiration generator for me. This is how it works at the moment, until the next break through ...

    • > but for what purpose?

      I recently introduced a non-technical person to Claude Code, and this non-human behavior was a big sticking point. They tried to talk to Claude similar as to a human, presenting it one piece of information at a time. With humans this is generally beneficial, and they will either nod for you to continue or ask clarifying questions. With Claude this does not work well, you have to infodump as much as possible in each message

      So even from a perspective of "how do we make this automaton into the best tool", a more human-like conversation flow might be beneficial. And that doesn't seem beyond the technological capabilities at all, it's just not what we encourage in today's RLHF

      12 replies →

    • Clarifying ambiguity in questions before dedicating more resources to search and reasoning about the answer seems both essential and almost trivial to elicit via RLHF.

      I'd be surprised if you can't already make current models behave like that with an appropriate system prompt.

    • The disconnect is that companies are trying desperately to frame LLMs as actual entities and not just an inert tech tool. AGI as a concept is the biggest example of this, and the constant push to "achieve AGI" is what's driving a lot of stock prices and investment.

      A strictly machinelike tool doesn't begin answers by saying "Great question!"

  • Training data is quite literally weighted this way - long responses on Reddit have lots of tokens, and brief responses don't get counted nearly as much.

    The same goes for "rules" - you train an LLM with trillions of tokens and try to regulate its behavior with thousands. If you think of a person in high school, grading and feedback is a much higher percentage of the training.

    • Not to mention that Reddit users seek "confident idiots". Look at where the thoughtful questions that you'd expect to hear in a human social setting end up (hint: Downvoted until they disappear). Users on Reddit don't want to have to answer questions. They want to read the long responses that they can then nitpick. LLMs have no doubt picked up on that in the way they are trained.

  • > The thing that bothers me the most about LLMs is

    What bothers me the most is the seemingly unshakable tendency of many people to anthropomorphise this class of software tool as though it is in any way capable of being human.

    What is it going to take? Actual, significant loss of life in a medical (or worse, military) context?

    • That qualifier only makes the anthropormorphization more sound. Have you actually thought it through? Give an untrained and unspecialized human the power to cause significant loss of life in a medical context in the same exact capacity, and it's all but guaranteed that's the outcome you'll end up with.

      I think it's important to be skeptical and push back against a lot of the ridiculous mass-adoption of LLMs, but not if you can't actually make a well-reasoned point. I don't think you realize the damage you do when the people gunning for mass proliferation of LLMs in places they don't belong can only find examples of incoherent critique.

      1 reply →

  • A lot of this, I suspect, on the basis of having worked on a supervised fine-tuning project for one of the largest companies in this space, is that providers have invested a lot of money in fine-tuning datasets that sound this way.

    On the project I did work on, reviewers were not allowed to e.g. answer that they didn't know - they had to provide an answer to every prompt provided. And so when auditing responses, a lot of difficult questions had "confidently wrong" answers because the reviewer tried and failed, or all kinds of evasive workarounds because they knew they couldn't answer.

    Presumbly these providers will eventually understand (hopefully already has - this was a year ago) that they also need to train the models to understand when the correct answer is "I don't know", or "I'm not sure. I think maybe X, but ..."

    • Its not the training/tuning, its pretty much the nature of llms. The whole idea is to give a best quess of the token. The more complex dynamics behind the meaning of the words and how those words relate to real world concepts isn't learned.

      1 reply →

  • They are purposely trained to be this way.

    In a way it's benchmaxxing because people like subservient beings that help them and praise them. People want a friend, but they don't want any of that annoying friction that comes with having to deal with another person.

  • If you're paying per token then there is a big business incentive for the counterparty to burn tokens as much as possible.

    • Making a few pennies more from inference is not even on the radar of the labs making frontier models. The financial stakes are so much higher than that for them.

    • As long as there's no moat (and arguably current LLM inference APIs are far from having one), it arguably doesn't really matter what users pay by.

      The only thing I care about are whether the answer helps me out and how much I paid for it, whether it took the model a million tokens or one to get to it.

    • If I'll pay to get a fixed result, sure. I'd expect a Jevons paradox effect: if LLMs got me results twice as fast for the same cost, I'm going to use it more and end up paying more in total.

      Maximizing the utility of your product for users is usually the winning strategy.

  • Cursor Plan mode works like this. It restricts the LLMs access to your environment and will allow you to iteratively ask and clarify and it’ll piece together a plan that it allows you to review before it takes any action.

    ChatGPT deep research does this but it’s weird and forced because it asks one series of questions and then goes off to the races, spending a half hour or more building a report. It’s frustrating if you don’t know what to expect and my wife got really mad the first time she wasted a deep research request asking it “can you answer multiple series of questions?” Or some other functionality clarifying question.

    I’ve found Crusor’s plan mode extremely useful, similar to having a conversation with a junior or offshore team member who is eager to get to work but not TOO eager. These tools are extremely useful we just need to get the guard rails and user experience correct.

  • My favorite description of an LLM so far is of a typical 37-year-old male Reddit user. And in that sense, we have already created the AGI.

  • Lately, ChatGPT 5.1 has been less guilty of this and sometimes holds off answering fully and just asks me to clarify what I meant.

  • There are plenty of LLM services that have a conversational style. The paragraph blocks thing is just a style.

  • This is not necessarily a fundamental limitation. It's a consequence of a fine-tuning process where human raters decide how "good" an answer is. They're not rating the flow of the conversation, but looking at how complete / comprehensive the answer to a one-shot question looks like. This selects for walls of overconfident text.

    Another thing the vendors are selecting for is safety / PR risk. If an LLM answers to a hobby chemistry question in a matter-of-factly way, that's a disastrous PR headline in the making. If they open with several paragraphs of disclaimers or just refuse to answer, that's a win.

  • Its not a magic technology, they can only represent data they were trained on. Naturally most represented data in their training data is NOT conversational. Consider that such data is very limited and who knows how it was labeled if at all during pretraining. But with that in mind, LLM's definitely can do all the things you describe, but a very robust and well tested system prompt has to be used to coax this behavior out. Also a proper model has to be used, as some models are simply not trained for this type of interaction.

  • a) I find myself fairly regularly irritated by the flow of human-human conversations. In fact, it's more common than not. Of course, I have years of practice handling that more or less automatically, so it rarely raises to the level of annoyance, but it's definitely work I bring to most conversations. I don't know about you but that's not really a courtesy I extend to the LLM.

    b) If it is, in fact, just one setting away, then I would say it's operating fairly similarly?

  • I didn't have the words to articulate some of my frustrations, but I think you summed it up nicely.

    For example, there's been many times when they take it too literally instead of looking at the totality of the context and what was written. I'm not an LLM, so I don't have perfect grasp on every vocab term for every domain and it feels especially pandering when they repeat back the wrong word but put it in quotes or bold instead of simply asking if I meant something else.

  • >LLMs don't do this

    They did at the beginning. It used to be that if you wanted a full answer with an intro, bullet points, lists of pros/cons, etc., you had to explicitly ask for it in the prompt. The answers were also a lot more influenced by the tone of the prompt instead of being forced into answering with a specific format like it does right now.

  • That just means that you need to learn to adapt to the situation: Make your prompt a carefully crafted multi-paragraph description of every detail of the problem and what you want from the solution, with bullet points if appropriate.

    Maybe it feels a bit sad that you have follow what the LLM wants, but that's just how any tool works really.

  • When I expect it to do that I just end my prompt with '. Discuss' - usually this works really well. Not exactly human like - it tries to list all questions and variants at once - but most with good default answers so I only need to engage with a couple of them.

  • I like Manus's suggested follow-up questions.

    In fact, sometimes I screenshot them and use Mac's new built-in OCR to copy them, because Manus gives me three options but they disappear if I click one, and sometimes I really like 2 or even all 3.

  • By default they don't ask questions. You can craft that behaviour with the system message or account settings. Though they will tend to ask 20 questions at once so you have to request it to limit to one question at a time to get a more natural experience.

  • The day when the LLM responds to my question with another question will be quite interesting. Especially at work, when someone asks me a question I need to ask for clarifying information to answer the original question fully.

    • Have you tried adding a system prompt asking for this behavior? They seem to readily oblige when I ask for this (e.g. brainstorming)

  • I suspect that's because, trained on website content, seo values more text (see recipe websites). So the default response is fluff.

  • The benchmarks are dumb but highly followed so everyone optimizes for the wrong thing.

  • Reflect a moment over the fact that LLMs currently are just text generators.

    Also that the conversational behavior we see it’s just examples of conversations that we have the model to mimic so when we say “System: you are a helpful assistant. User: let’s talk. Assistant:” it will complete the text in a way that mimics a conversation?.

    Yeah, we improved over that using reinforcement learning to steer the text generation into paths that lead to problem solving and more “agentic” traces (“I need to open this file the user talked about to read it and then I should run bash grep over it to find the function the user cited”), but that’s just a clever way we found to let the model itself discover which text generation paths we like the most (or are more useful to us).

    So to comment on your discomfort, we (humans) trained the model to spill out answers (there are thousand of human being right now writing nicely though and formatted answers to common questions so that we can train the models on that).

    If we try to train the models to mimic long dances into shared meaning we will probably decrease their utility. And we won’t be able anyway to do that because then we would have to have customized text traces for each individual instead of question-answers pairs.

    Downvoters: I simplified things a lot here, in name of understanding, so bear with me.

  • You just need to be more explicit. Including “ask clarifying questions” in your prompt makes a huge difference. Not sure if you use Claude Code but if you do, use plan mode for almost every task.

  • When using an LLM for anything serious (such as at work) I have a standard canned postscript along the lines of “if anything about what I am asking is unclear or ambiguous, or if you need more context to understand what I’m asking, you will ask for clarification rather than try to provide an answer”. This is usually highly effective.

  • same experience. i try to learn with it but i can't really tell if what its teaching me is actually correct or merely making up when i challenge it with followup questions.

  • This drives me nuts when trying to bounce an architecture or coding solution idea off an LLM. A human would answer with something like "what if you split up the responsibility and had X service or Y whatever". No matter how many times you tell the LLM not to return code, it returns code. Like it can't think or reason about something without writing it out first.

    • > Like it can't think or reason about something without writing it out first.

      Setting aside the philosophical questions around "think" and "reason"... it can't.

      In my mind, as I write this, I think through various possibilities and ideas that never reach the keyboard, but yet stay within my awareness.

      For an LLM, that awareness and thinking through can only be done via its context window. It has to produce text that maintains what it thought about in order for that past to be something that it has moving forward.

      There are aspects to a prompt that can (in some interfaces) hide this internal thought process. For example, the ChatGPT has the "internal thinking" which can be shown - https://chatgpt.com/share/69278cef-8fc0-8011-8498-18ec077ede... - if you expand the first "thought for 32 seconds" bit it starts out with:

          I'm thinking the physics of gravity assists should be stable enough for me to skip browsing since it's not time-sensitive. However, the instructions say I must browse when in doubt. I’m not sure if I’m in doubt here, but since I can still provide an answer without needing updates, I’ll skip it.
      

      (aside: that still makes me chuckle - in a question about gravity assists around Jupiter, it notes that its not time-sensitive... and the passage "I’m not sure if I’m in doubt here" is amusing)

      However, this is in the ChatGPT interface. If I'm using an interface that doesn't allow internal self-prompts / thoughts to be collapsed then such an interface would often be displaying code as part of its working through the problem.

      You'll also note a bit of the system prompt leaking in there - "the instructions say I must browse when in doubt". For an interface where code is the expected product, then there could be system prompts that also get in there that try to always produce code.

  • There are billions of humans. Not every one speaks the same way all the time. The default behavior is trying to be useful for most people.

    It's easy to skip and skim content you don't care about. It's hard to prod and prod to get to to say something you do care about it if the machine is traint to be very concise.

    Complaining the AI can't read your mind is exceptionally high praise for the AI, frankly.

  • > When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.

    You obviously never wasted countless hours trying to talk to other people on online dating apps.

  • In the US anyway, most adults read at a middle school level.

    It's not "masquerading as a human". The majority of humans are functional illiterates who only understand the world through the elementary principles of their local culture.

    It's the minority of the human species that take what amounts to little more than arguing semantics that need the reality check. Unless one is involved in work that directly impacts public safety (defined as harm to biology) the demand to apply one concept or another is arbitrary preference.

    Healthcare, infrastructure, and essential biological support services are all most humans care about. Everything else the majority see as academic wank.

We are trying to fix probability with more probability. That is a losing game.

Thanks for pointing out the elephant in the room with LLMs.

The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility.

  • The factuality problem with LLMs isn't because they are non-deterministic or statistically based, but simply because they operate at the level of words, not facts. They are language models.

    You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place. All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C". The LLM wasn't designed to know or care that outputting "B" would represent a lie or hallucination, just that it's a statistically plausible potential next word.

    • I think this is why I get much more utility out of LLMs with writing code. Code can fail if the syntax is wrong; small perturbations in the text (e.g. add a newline instead of a semicolon) can lead to significant increases in the cost function.

      Of course, once an LLM is asked to create a bespoke software project for some complex system, this predictability goes away, the trajectory of the tokens succumbs to the intrinsic chaos of code over multi-block length scales, and the result feels more arbitrary and unsatisfying.

      I also think this is why the biggest evangelists for LLMs are programmers, while creative writers and journalists are much more dismissive. With human language, the length scale over which tokens can be predicted is much shorter. Even the "laws" of grammar can be twisted or ignored entirely. A writer picks a metaphor because of their individual reading/life experience, not because its the most probable or popular metaphor. This is why LLM writing is so tedious, anodyne, sycophantic, and boring. It sounds like marketing copy because the attention model and RL-HF encourage it.

    • >but simply because they operate at the level of words, not facts. They are language models.

      Facts can be encoded as words. That's something we also do a lot for facts we learn, gather, and convey to other people. 99% of university is learning facts and theories and concept from reading and listening to words.

      Also, even when directly observing the same fact, it can be interpreted by different people in different ways, whether this happens as raw "thought" or at the conscious verbal level. And that's before we even add value judgements to it.

      >All they store are language statistics, boiling down to "with preceding context X, most statistically likely next words are A, B or C".

      And how do we know we don't do something very similar with our facts - make a map of facts and concepts and weights between them for retrieving them and associating them? Even encoding in a similar way what we think of as our "analytic understanding".

      7 replies →

    • In a way though those things aren't so different as they might first appear. The factual answer is traditionally the most plausible response to many questions. They don't operate on any level other than pure language but there are a heap of behaviours which emerge from that.

      8 replies →

    • Yeah, that’s very well put. They don’t store black-and-white they store billions of grays. This is why tool use for research and grounding has been so transformative.

      1 reply →

    • > You can't blame an LLM for getting the facts wrong, or hallucinating, when by design they don't even attempt to store facts in the first place

      On one level I agree, but I do feel it’s also right to blame the LLM/company for that when the goal is to replace my search engine of choice (my major tool for finding facts and answering general questions), which is a huge pillar of how they’re sold to/used by the public.

      3 replies →

    • I think they are much smarter than that. Or will be soon.

      But they are like a smart student trying to get a good grade (that's how they are trained!). They'll agree with us even if they think we're stupid, because that gets them better grades, and grades are all they care about.

      Even if they are (or become) smart enough to know better, they don't care about you. They do what they were trained to do. They are becoming like a literal genie that has been told to tell us what we want to hear. And sometimes, we don't need to hear what we want to hear.

      "What an insightful price of code! Using that API is the perfect way to efficiently process data. You have really highlighted the key point."

      The problem is that chatbots are trained to do what we want, and most of us would rather have a syncophant who tells us we're right.

      The real danger with AI isn't that it doesn't get smart, it's that it gets smart enough to find the ultimate weakness in its training function - humanity.

      12 replies →

  • Determinism is not the issue. Synonyms exist, there are multiple ways to express the same message.

    When numeric models are fit to say scientific measurements, they do quite a good job at modeling the probability distribution. With a corpus of text we are not modeling truths but claims. The corpus contains contradicting claims. Humans have conflicting interests.

    Source-aware training (which can't be done as an afterthought LoRA tweak, but needs to be done during base model training AKA pretraining) could enable LLM's to express according to which sources what answers apply. It could provide a review of competing interpretations and opinions, and source every belief, instead of having to rely on tool use / search engines.

    None of the base model providers would do it at scale since it would reveal the corpus and result in attribution.

    In theory entities like the European Union could mandate that LLM's used for processing government data, or sensitive citizen / corporate data MUST be trained source-aware, which would improve the situation, also making the decisions and reasoning more traceable. This would also ease the discussions and arguments about copyright issues, since it is clear LLM's COULD BE MADE TO ATTRIBUTE THEIR SOURCES.

    I also think it would be undesirable to eliminate speculative output, it should just mark it explicitly:

    "ACCORDING to <source(s) A(,B,C,..)> this can be explained by ...., ACCORDING to <other school of thought source(s) D,(E,F,...)> it is better explained by ...., however I SUSPECT that ...., since ...."

    If it could explicitly separate the schools of thought sourced from the corpus, and also separate its own interpretations and mark them as LLM-speculated-suspicions, then we could still have the traceable references, without losing the potential novel insights LLM's may offer.

    • "chatGPT, please generate 800 words of absolute bullshit to muddy up this comments section which accurately identifies why LLM technology is completely and totally dead in the water."

      1 reply →

  • Bruce Schneier put it well:

    "Willison’s insight was that this isn’t just a filtering problem; it’s architectural. There is no privilege separation, and there is no separation between the data and control paths. The very mechanism that makes modern AI powerful - treating all inputs uniformly - is what makes it vulnerable. The security challenges we face today are structural consequences of using AI for everything."

    - https://www.schneier.com/crypto-gram/archives/2025/1115.html...

    • Attributing that to Simon when people have been writing articles about that for the last year and a half doesn't seem fair. Simon gave that view visibility, because he's got a pulpit.

      2 replies →

  • I couldn't agree with you more.

    I really do find it puzzling so many on HN are convinced LLM's reason or think and continue to entertain this line of reasoning. At the same time also somehow knowing what precisely the brain/mind does and constantly using CS language to provide correspondences where there are none. The simplest example being that LLM's somehow function in a similar fashion to human brains. They categorically do not. I do not have most all of human literary output in my head and yet I can coherently write this sentence.

    As I'm on the subject LLM's don't hallucinate. They output text and when that text is measured and judged by a human to be 'correct' then it is. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. They don't actually understand anything about what they output. It's just text.

    My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) will do the same thing as the latest SOTA LLM. It's just an algorithm.

    I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

    • Most of things that were considered reasoning are now trivially implemented by computers - from arithmetic, through logical inference (surely this is reasoning - isn't it) to playing chess. Now LLMs go even further - what is your definition of reasoning? What concrete action is in that definition that you are sure computer will not do in lets say 5 years?

      1 reply →

    • > The simplest example being that LLM's somehow function in a similar fashion to human brains. They categorically do not. I do not have most all of human literary output in my head and yet I can coherently write this sentence.

      The ratio of cognition to knowledge is much higher in humans that LLMs. That is for sure. It is improving in LLMs, particularly small distillations of large models.

      A lot of where the discussion gets hung up on is just words. I just used "knowledge" to mean ability to recall and recite a wide range of fasts. And "cognition" to mean the ability to generalize, notice novel patterns and execute algorithms.

      > They don't actually understand anything about what they output. It's just text.

      In the case of number multiplication, a bunch of papers have shown that the correct algorithm for the first and last digits of the number are embedded into the model weights. I think that counts as "understanding"; most humans I have talked to do not have that understanding of numbers.

      > It's just an algorithm.

      > I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

      I don't think something being an algorithm means it can't reason, know or understand. I can come up with perfectly rigorous definitions of those words that wouldn't be objectionable to almost anyone from 2010, but would be passed by current LLMs.

      I have found anthropomorphizing LLMs to be a reasonably practical way to leverage the human skill of empathy to predict LLM performance. Treating them solely as text predictors doesn't offer any similar prediction; it is simply too complex to fit into a human mind. Paying a lot of attention to benchmarks, papers, and personal experimentation can give you enough data to make predictions from data, but it is limited to current models, is a lot of work, and isn't much more accurate than anthropomorphization.

      2 replies →

    • I have had conversations at work, with people who I have reason to believe are smart and critical, in which they made the claim that humans and AI basically learn in the same way. My response to them, as to anyone that makes this claim, is that the amount of data ingested by someone with severe sensory dysfunction of one sort or another is very small. Helen Keller is the obvious extreme example, but even a person who is simply blind is limited to the bandwidth of their hearing.

      And yet, nobody would argue that a blind person is any less intelligent that a sighted person. And so the amount of data a human ingests is not correlated with intelligence. Intelligence is something else.

      When LLMs were first proposed as useful tools for examining data and proving answers to questions, I wondered to myself how they would solve the problem of there being no a-priori knowledge of truth in the models. How they would find a way of sifting their terabytes of training data so that the models learnt only true things.

      Imagine my surprise that not only did they not attempt to do this, but most people did not appear to understand that this was a fundamental and unsolvable problem at the heart of every LLM that exists anywhere. That LLMs, without this knowledge, are just random answer generators. Many, many years ago I wrote a fun little Markov-chain generator I called "Talkback", that you could feed a short story to and then have a chat with. It enjoyed brief popularity at the University I attended, you could ask it questions and it would sort-of answer. Nobody, least of all myself, imagined that the essential unachievable idea - "feed in enough text and it'll become human" - would actually be a real idea in real people's heads.

      This part of your answer though;

      "My paper and pen version of the latest LLM .... My paper and pen version of the latest LLM"

      Is just a variation of the Chinese Room argument, and I don't think it holds water by itself. It's not that it's just an algorithm, it's that learning anything usefully correct from the entire corpus of human literary output by itself is fundamentally impossible.

      1 reply →

    • People believe that because they are financially invested in it. Everyone has known LLMs are bullshit for years now.

  • You could make an LLM deterministic if you really wanted to without a big loss in performance (fix random seeds, make MoE batching deterministic). That would not fix hallucinations.

    I don't think using deterministic / stochastic as a diagnostic is accurate here - I think that what we're really talking is about some sort of fundamental 'instability' of LLMs a la chaos theory.

    • Hallucinations can never be fixed. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. The output is measured and judged by a human who then classifies it as 'correct' or 'incorrect'. In the later case it seems to be labelled as a 'hallucination' as if it did something wrong. It did nothing wrong and worked exactly as it was programmed to do.

    • We talk about "probability" here because the topic is hallucination, not getting different answers each time you ask the same question. Maybe you could make the output deterministic but does not help with the hallucination problem at all.

      1 reply →

  • > The basic design is non-deterministic

    Is it? I thought an LLM was deterministic provided you run the exact same query on exact same hardware at a temperature of 0.

    • Not quite then as well, since a lot is typically executed in parallel and the implementation details of most number representations make them sensitive to the order of operations.

      Given how much number crunching is at the heart of LLMs, these small differences add up.

    • My understanding is that it selects from a probability distribution. Raising the temperature merely flattens that distribution, Boltzmann factor style

  • >The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility

    We ourselves are non-deterministic. We're hardly ever in the same state, can't rollback to prior states, and we hardly ever give the same exact answer when asked the same exact question (and if we include non-verbal communication, never).

  • This very repo is just to "fix probability with more probability."

    > The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

    What a brainrot idea... the whole post being written by LLM is the icing on the cake.

  • The author's solution feels like adding even more probability to their solution.

    > The next time the agent runs, that rule is injected into its context.

    Which the agent may or may not choose to ignore.

    Any LLM rule must be embedded in an API. Anything else is just asking for bugs or security holes.

  • Isn't that true of everything else also? Facts about real things are the result of sampling reality several times and coming up with consistent stores about those things. The accuracy of those stories is always bounded by probabilities related to how complete your sampling strategy is.

  • Hard drives and network pipes are non-deterministic too, we use error correction to deal with that problem.

  • Exactly. We treat them like databases, but they are hallucination machines.

    My thesis isn't that we can stop the hallucinating (non-determinism), but that we can bound it.

    If we wrap the generation in hard assertions (e.g., assert response.price > 0), we turn 'probability' into 'manageable software engineering.' The generation remains probabilistic, but the acceptance criteria becomes binary and deterministic.

    • but the acceptance criteria becomes binary and deterministic.

      Unfortunately, the use-case for AI is often where the acceptance criteria is not easily defined --- a matter of judgment. For example, "Does this patient have cancer?".

      In cases where the criteria can be easily and clearly stipulated, AI often isn't really required.

      4 replies →

    • I don't agree that users see them as databases. Sure there are those who expect LLMs to be infallible and punish the technology when it disappoints them, but it seems to me that the overwhelmingly majority quickly learn what AI's shortcomings are, and treat them instead like intelligent entities who will sometimes make mistakes.

      5 replies →

    • > We treat them like databases, but they are hallucination machines.

      Which is kind of crazy because we don't even treat people as databases. Or at least we shouldn't.

      Maybe it's one of those things that will disappear form culture one funeral at a time.

      1 reply →

  • I find it amusing that once you try to take LLMs and do productive work with them either this problem trips you up constantly OR the LLM ends up becoming a shallow UI over an existing app (not necessarily better, just different).

    • The UI of the Internet (search) has recently gotten quite bad. In this light it is pretty obvious why Google is working heavily on these models.

      I fully expect local modes to eat up most other LLM applications—there’s no reason for your chat buddy or timer setter to reach out to the internet, but LLMs are pretty good at vibes based search, and that will always require looking at a bunch of websites, so it should slot exactly into the gap left by search engines becoming unusable.

      2 replies →

  • This is exactly why I don't like dealing with most people.

    • Every thread like this I like to go through and count how many people are making the pro-AI "Argument from Misanthropy." Based on this exercise, I believe that the biggest AI boosters are simply the most disagreeable people in the industry, temperamentally speaking.

      9 replies →

  • lol humans are non-deterministic too

    • But we also have a stake in our society, in the form of a reputation or accountability, that greatly influences our behaviour. So comparing us to an LLM has always been meaningless anyway.

      4 replies →

    • Which is why every tool that is better than humans at a certain task are deterministic.

    • Yeah, but not when they are expected to perform in a job role. Too much nondeterminism in that case leads to firing and replacing the human with a more deterministic one.

      2 replies →

- Claude, please optimise the project for performance.

o Claude goes away for 15 minutes, doesn't profile anything, many code changes.

o Announces project now performs much better, saving 70% CPU.

- Claude, test the performance.

o Performance is 1% _slower_ than previous.

- Claude, can I have a refund for the $15 you just wasted?

o [Claude waffles], "no".

  • I’ve always found the hard numbers on performance improvement hilarious. It’s just mimicking what people say on the internet when they get performance gains

    • > It’s just mimicking what people say on the internet when they get performance gains

      probably read bunch of junior/mid level resumes saying they optimized 90% of company by 80%

  • If you provide it a benchmark script (or ask it to write one) so it has concrete numbers to go off of, it will do a better job.

    I'm not saying these things don't hallucinate constantly, they do. But you can steer them toward better output by giving them better input.

  • While you’re making unstructured requests and expecting results, why don’t you ask your barista to make you a “better coffee” with no instructions. Then, when they make a coffee with their own brand of creativity, complain that it tastes worse and you want your money back.

    • Both "better coffee" and "faster code" are measurable targets. Somewhat vaguely defined, but nobody is stopping the Barista or Claude from asking clarifying questions.

      If I gave a human this task I would expect them to transform the vague goal into measurable metrics, confirm that the metrics match customer (==my) expectations then measure their improvements on these metrics.

      This kind of stuff is a major topic for MBAs, but it's really not beyond what you could expect from a programmer or a barista. If I ask you for a better coffee, what you deliver should be better on some metric you can name, otherwise it's simply not better. Bonus points if it's better in a way I care about

      1 reply →

    • I was experimenting with Claude Code and requested something more CPU efficient in a very small project, there were a few avenues to explore, I was interested to see what path it would take. It turned out that it seized upon something which wasn't consuming much CPU anyway and was difficult to optimise further. I learned that I'd have to be more explicit in future and direct an analysis phase and probably kick-in a few strategies for performance optimisation which it could then explore. The refund request was an amusement. It was $15 well spent on my own learning.

      1 reply →

    • I could also argue if a barista gets multiple complaints about their coffee it's very much their and their employer's job to go away and figure out to make good coffee.

      It's very much not the customers job to learn about coffee and to direct them how to make a quality basic coffee

      And it's not rocket science.

    • "Optimize this code for performance" is not an unstructured or vague request.

      Any "performance" axis could have been used: Number of db hits, memory pressure, cpu usage, whatever.

      The LLM chose (or whatever) to use CPU performance, claimed a specific figure, and that figure was demonstrably not real.

      If you ask a barista to make you a better coffee, and the barista says "this coffee is hotter" and it just isn't, the problem is not underspecified requirements, the problem is that it just doesn't make any attempt to say things that are only correct. Technically it can't make any attempt.

      If I tell an intern "Optimize this app for performance" and they come back having reduced the memory footprint by half, but that didn't actually matter because the app was never memory constrained, I could hem and haw about not giving clear instructions, but I could also use that as a teachable moment to help the budding engineer learn how to figure out what matters when given that kind of leeway, to still have impact.

      If they instead come back and say "I cut memory usage in half" and then you have them run the app and it has the exact same memory usage, you don't think about not giving clear enough instructions, because you should be asking the intern "Why are you lying to my face?" and "Why are you confidently telling me something you did not verify?".

      1 reply →

  • The last bit, in my limited experience:

    > Claude: sorry you have to want until XX:00 as you have run out of credit.

  • If you really want to do this, you should probably ask for a plan first and review it.

  • I can't help but notice that your first two bullets match rather closely the behavior of countless pre-AI university students assigned a project.

OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").

I built an open-source library to enforce these logic/safety rules outside the model loop: https://github.com/imtt-dev/steer

  • This approach kind of reminds me of taking an open-book test. Performing mandatory verification against a ground truth is like taking the test, then going back to your answers and looking up whether they match.

    Unlike a student, the LLM never arrives at a sort of epistemic coherence, where they know what they know, how they know it, and how true it's likely to be. So you have to structure every problem into a format where the response can be evaluated against an external source of truth.

  • Thanks a lot for this. Also one question in case anyone could shed a bit of light: my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input). For sure it won’t prevent factually wrong replies/hallucination, only maintains generation consistency (eq. classification tasks). Is this universally correct or is it dependent on model used? (or downright wrong understanding of course?)

    • > my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input).

      That's typically correct. Many models are implemented this way deliberately. I believe it's true of most or all of the major models.

      > Is this universally correct or is it dependent on model used?

      There are implementation details that lead to uncontrollable non-determinism if they're not prevented within the model implementation. See e.g. the Pytorch docs for CUDA convolution determinism: https://docs.pytorch.org/docs/stable/notes/randomness.html#c...

      That documents settings like this:

          torch.backends.cudnn.deterministic = True 
      

      Parallelism can be a source of non-determinism if it's not controlled for, either implicitly via e.g. dependencies or explicitly.

  • You should use structured output rather than checking and rechecking for valid json. It can’t solve all of your problems but it can enforce a schema on the output format.

LLMs are text model, not world models and that is the root cause of the problem. If you and I would be discussing furniture and for some reason you had assumed the furniture to be glued to the ceiling instead of standing on the floor (contrived example) then it would most likely only take one correction based on your actual experience that you are probably on the wrong track. An LLM will happily re-introduce that error a few ping-pongs later and re-establish the track it was on before because that apparently is some kind of attractor.

Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

Facts are contagious: they work just like foundation stones, if you allow incorrect facts to become a part of your foundation you will be producing nonsense. This is my main gripe with AI and it is - funny enough - also my main gripe with some mass human activities.

  • >LLMs are text model, not world models and that is the root cause of the problem.

    Is it though? In the end, the information in the training texts is a distilled proxy for the world, and the weighted model ends up being a world model, just an once-removed one.

    Text is not that different to visual information in that regard (and humans base their world model on both).

    >Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.

    Regular humans believe all kinds of facts that are nonsense, many others that are wrong, and quite a few that are even counter to logic too.

    And short of omnipresense and omniscience, directly examining the whole world, any world model (human or AI), is built on sets of facts many of which might not be true or valid to begin with.

    • I really think it is, this is the exact same thing that keeps going wrong in these conversations over-and-over again. There simply is no common sense, none at all, just a likelihood of applicability. To the point that I even wonder how it is possible to get such basic stuff for which there is an insane amount of support wrong.

      I've had an hour long session which essentially revolved around why the landing gear of an aircraft is at the bottom, not at the top of the vehicle (paraphrased for good reasons but it was really that basic). And this happened not just once, but multiple times. Confident declarations followed by absolute nonsense, I've even had - I think it was ChatGPT - try to gaslight me with something along the lines of 'you yourself said' on something that I did not say (this is probably the most person like thing I've seen it do).

      1 reply →

    • People have an actual world model, though, that they have to deal with in order to get the food into their mouths or to hit the toilet properly.

      The "facts" that they believe that may be nonsense are part of an abstract world model that is far from their experience, for which they never get proper feedback (such as the political situation in Bhutan, or how their best friend is feeling.) In those, it isn't surprising that they perform like an LLM, because they're extracting all of the information from language that they've ingested.

      Interestingly, the feedback that people use to adjust the language-extracted portions of their world models is how demonstrating their understanding of those models seems to please or displease the people around them, who in turn respond in physically confirmable ways. What irritates people about simpering LLMs is that they're not doing this properly. They should be testing their knowledge with us (especially their knowledge of our intentions or goals), and have some fear of failure. They have no fear and take no risk; they're stateless and empty.

      Human abstractions are based in the reality of the physical responses of the people around them. The facts of those responses are true and valid results of the articulation of these abstractions. The content is irrelevant; when there's no opportunity to act, we're just acting as carriers.

      2 replies →

    • >In the end, the information in the training texts is a distilled proxy for the world

      This is routinely asserted. How has it been proven?

      Humans write all sorts of text that has zero connection to reality, even when they are ostensibly writing about reality.

      Training on ancient greek philosophy which was expressly written to distill knowledge about the real world would produce a stupid LLM that doesn't know about the real world, because the training text was itself wrong about the underlying world.

      Also, if LLMs were able to extract underlying truth from training material, why can't they do math very well? It would be easy to train an LLM on only correct math, and indeed you could generate any size corpus of provably correct math you want. I assume someone somewhere has demonstrated success training a neural network on math and having it regenerate something like "addition" or whatever, but how well would such a process survive if a large fraction of it's training material was instead just incorrect math?

      The training text is nothing more than human generated text, and asserting anything about that more concrete than "Humans consider this text good enough to be worth writing" is fallacious.

      This even applies if your training corpus is, for example, only physics scientific papers that have been strongly replicated and are likely "true". Unless the LLM is also trained on the data itself, the only information available is what the humans thought and wrote. There's no definite link between that and actual reality, which is why physics accepted an "Aether" for so long. The data we had up to that point aligned with our incorrect models. You could not disambiguate between the wrong Aetheric models and a better model with the data we had, and that would remain true of text written about the data.

      Humans suck at distilling fact out of reality despite our direct connection to it for all sorts of fun reasons you can read about in psychology, but if you disconnect a human from reality, it only gets worse.

      Why would you believe LLMs could possibly be different? A model trained on bad data cannot magically figure out which data is bad.

      1 reply →

  • The "world model" is what we often refer to as the "context". But it is hard to anticipate bad assumptions that seem obvious because of our existing world model. One of the first bugs I scanned past from LLM generated code was something like:

    if user.id == "id": ...

    Not anticipating that it would arbitrarily put quotes around a variable name. Other time it will do all kinds of smart logic, generate data with ids then fail to use those ids for lookups, or something equally obvious.

    The problem is LLMs guess so much correctly that it is near impossible to understand how or why they might go wrong. We can solve this with heavy validation, iterative testing, etc. But the guardrails we need to actually make the results bulletproof need to go far beyond normal testing. LLMs can make such fundamental mistakes while easily completing complex tasks that we need to reset our expectations for what "idiot proofing" really looks like.

    • > The "world model" is what we often refer to as the "context".

      No, we often do not, and when we do that's just plain wrong.

Confident idiot: I’m exploring using LLM for diagram creation.

I’ve found after about 3 prompts to edit an image with Gemini, it will respond randomly with an entirely new image. Another quirk is it will respond “here’s the image with those edits” with no edits made. It’s like a toaster that will catch on fire every eighth or ninth time.

I am not sure how to mitigate this behavior. I think maybe an LLM as a judge step with vision to evaluate the output before passing it on to the poor user.

  • I had a similar result trying to create 16 similarly styled images. After half a dozen it just started kicking out the same image over and over again no matter what the prompt said. Even the “thinking” looked right, but the image was just a repeat. I don’t know if this is some type of context limitation or what.

    I got around it by using a new prompt/context for each image. This required some rethinking about how to make them match. What I did was create a sprite sheet with the first prompt and then only replaced (edited) the second prompt.

    I still got some consistency problems because there were a few important details left out of my sprite sheet. Next time I think I’ll create those individually and then attach them as context for additional prompts.

    • Oh smart. This is good guidance. Yeah fascinating how longer running context causes these side effects, especially the repeated image with no changes bug.

  • Whats your thoughts on the diagram as code movement? I'd prefer to have an LLM utilize those as it can atleast drive some determinism through it rather than deal with the slippery layer that is prompt control for visual LLMs.

    • I think that's the right approach and what I've been experimenting with. Diagram as code and then style transfer from output diagram to desired look. That's where I've had the most success.

  • Have you considered that perhaps such things simply are not within its capabilities?

    • I mean, one of its flagship features is to make precise edits to images. And it's really good at it... until it randomly isn't.

  • Yes, same here.

    I don't know if it's a fault with the model or just a bug in the Gemini app.

  • same. i gave it a very well hand drawn floor plan but never seems to be able to create a formal version of it. Its very very simple too.

    makes hilarious mistakes like putting toilet right in the middle of living room.

    I dont get all the hype. am i stupid.

This comment will probably get buried because I’m late to the party, but I’d like to point out that while they identify a real problem, the author’s approach—using code or ASTs to validate LLM output—does not solve it.

Yes, the approach can certainly detect (some) LLM errors, but it does not provide a feasible method to generate responses that don’t have the errors. You can see at the end that the proposed solution is to automatically update the prompt with a new rule, which is precisely the kind of “vibe check” that LLMs frequently ignore. If they didn’t, you could just write a prompt that says “don’t make any mistakes” and be done with it.

You can certainly use this approach to do some RL on LLM code output, but it’s not going to guarantee correctness. The core problem is that LLMs do next-token prediction and it’s extremely challenging to enforce complex rules like “generate valid code” a priori.

As a closing comment, it seems like I’m seeing a lot of technical half-baked stuff related to LLMs these days because LLMs are good at supporting people when they have half baked ideas, and are reluctant to openly point out the obvious flaws.

I had been working on NLP, NLU mostly, some years before LLMs. I've tried the universal sentence encoder alongside many ML "techniques" in order to understand user intentions and extract entities from text.

The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.

I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.

  • Yes, I was (and still am) similarly impressed with LLMs ability to understand the intent of my queries and requests.

    I've tried several times to understand the "multi-head attention" mechanism that powers this understanding, but I'm yet to build a deep intuition.

    Is there any research or expository papers that talk about this "understanding" aspect specifically? How could we measure understand without generation? Are there benchmarks out there specifically designed to test deep/nuanced understanding skills?

    Any pointers or recommended reading would be much appreciated.

We already have verification layers: high level strictly typed languages like Haskell, Ocaml, Rescript/Melange (js ecosystem), purescript (js), elm, gleam (erlang), f# (for .net ecosystem).

These aren’t just strict type systems but the language allows for algebraic data types, nominal types, etc, which allow for encoding higher level types enforced by the language compiler.

The AI essentially becomes a glorified blank filler filling in the blanks. Basic syntax errors or type errors, while common, are automatically caught by the compiler as part of the vibe coding feedback loop.

  • Interestingly, coding models often struggle with complex type systems, e.g. in Haskell or Rust. Of course, part of this has to do with the relative paucity of relevant training data, but there are also "cognitive" factors that mirror what humans tend to struggle with in those languages.

    One big factor behind this is the fact that you're no longer just writing programs and debugging them incrementally, iteratively dealing with simple concrete errors. Instead, you're writing non-trivial proofs about all possible runs of the program. There are obviously benefits to the outcome of this, but the process is more challenging.

    • Actually I found the coding models to work really well with these languages. And the type systems are not actually complex. Ocaml's type system is actually really simple, which is probably why the compiler can be so fast. Even back in the "beta" days of Copilot, despite being marketed as Python only, I found it worked for Ocaml syntax and worked just as well.

      The coding models work really well with esoteric syntaxes so if the biggest hurdle to adoption of haskell was syntax, that's definitely less of a hurdle now.

      > Instead, you're writing non-trivial proofs about all possible runs of the program.

      All possible runs of a program is exactly what HM type systems type check for. This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.

      1 reply →

Basic rule of MLE is to have guardrails on your model output; you don't want some high-leverage training data point to trigger problems in prob. These guardrails should be deterministic and separate from the inference system, and basically a stack of user-defined policies. LLMs are ultimately just interpolated surfaces and the rules are the same as if it were LOESS.

Yeah I’ve found that the only way to let AI build any larger amount of useful code and data for a user that does not review all of it requires a lot of “gutter rails”. Not just adding more prompting, because it is an after-the-fact solution. Not just verifying and erroring a turn, because it adds latency and allows the model to start spinning out of control. But also isolating tasks and autofixing output keep the model on track.

Models definitely need less and less of this for each version that comes out but it’s still what you need to do today if you want to be able to trust the output. And even in a future where models approach perfect, I think this approach will be the way to reduce latency and keep tabs on whether your prompts are producing the output you expected on a larger scale. You will also be building good evaluation data for testing alternative approaches, or even fine tuning.

>We are trying to fix probability with more probability. That is a losing game.

>We need to re-introduce Determinism into the stack.

>If it fails lets inject more prompts but call it "rules" and run the magic box again

Bravo.

Aren't we just reinventing programming languages from the ground up?

This is the loop (and honestly, I predicted it way before it started):

1) LLMs can generate code from "natural language" prompts!

2) Oh wait, I actually need to improve my prompt to get LLMs to follow my instructions...

3) Oh wait, no matter how good my prompt is, I need an agent (aka a for loop) that goes through a list of deterministic steps so that it actually follows my instructions...

4) Oh wait, now I need to add deterministic checks (aka, the code that I was actually trying to avoid writing in step 1) so that the LLM follows my instructions...

5) <some time in the future>: I came up with this precise set of keywords that I can feed to the LLM so that it produces the code that I need. Wait a second... I just turned the LLM into a compiler.

The error is believing that "coding" is just accidental complexity. "You don't need a precise specification of the behavior of the computer", this is the assumption that would make LLM agents actually viable. And I cannot believe that there are software engineers that think that coding is accidental complexity. I understand why PMs, CEOs, and other fun people believe this.

Side note: I am not arguing that LLMs/coding agents are nice. T9 was nice, autocomplete is nice. LLMs are very nice! But I am starting to be a bit too fed up to see everyone believing that you can get rid of coding.

  • The hard part is just learning interfaces quickly for programming. If only we had a good tool for that.

I dunno man, if you see response code 404 and start looking into network errors, you need to read up on http response codes. there is no way a network error results in a 404

Can someone please explain why these token guessing models aren't being combined with logic "filters?"

I remember when computers were lauded for being precise tools.

  • 1. Because no one knows how to do it. 2. Consider (a) a tool that can apply precise methods when they exist, and (b) a tool that can do that and can also imperfectly solve problems that lack precise solutions. Which is more powerful?

  • Intellij knows my Frob class does not have a static Blurb method, yet will still allow an LLM to generate a code completion of "frob.blurb()"

    It's insanity. This one stupid issue has cost me significant productivity. I got so much benefit from being able to hit "Tab" every few lines, but now I instead have to press whatever button combos or interactions cause the suggestion to go away, and then type what would have been suggested previously.

    We had really good code completion that never made this kind of mistake for 20 years. Apparently we are going to throw that all away because """AI"""?

    Just utter fucking insanity.

This is why TDD is how you want to do AI dev. The more tests and test gates, the better. Include profiling in your standard run. Add telemetry like it’s going out of fashion. Teach it how to use the tools in AGENTS.md. And watch the output. Tests. Observability. Gates. Have a non negotiable connection with reality.

The problem with these agent loops is that their text output is manipulated to then be fed back in as text input, to try and get a reasoning loop that looks something like "thinking".

But our human brains do not work like that. You don't reason via your inner monologue (indeed there are fully functional people with barely any inner monologue), your inner monologue is a projection of thoughts you've already had.

And unfortunately, we have no choice but to use the text input and output of these layers to build agent loops, because trying to build it any other way would be totally incomprehensible (because the meaning of the outputs of middle layers are a mystery). So the only option is an agent which is concerned with self-persuasion (talking to itself).

The proposed solution only works for answers where objective validation is easy. That's a start, but it's not going to make a big dent in the hallucination problem.

"Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get()."

Except for sites that block any user agent associated with an AI company.

I think this is for the best. Let the "confident idiot" types briefly entertain the idea of competency, hit the inevitable wall, and go away for good. It will take a few years, lots of mistakes, and billions (if not trillions) wasted, but those people will drift back to the mean or lower when they realize ChatGPT isn't the ghost of Robin Leach.

What I do, is actually running the task. If it is script, getting logs. If it is is website, getting screenshots. Otherwise it is coding in the blind.

Alike writing a script and having the attitude "yeah, I am good at it, I don't need to actually run it to know if works" - well, likely, it won't work. Maybe because of a trivial mistake.

it's actually just trust but verify type stuff:

- verifying isn't asking "is it correct?" - verifying is "run requests.get, does it return blah or no?'

just like with humans but usually for different reasons and with slightly different types of failures.

The interesting part perhaps, is that verifying pretty much always involves code, and code is great pre-compacted context for humans and machines alike. Ever tried to get LLM to do a visual thing? why is the couch at the wrong spot with a weird color?

if you make the LLM write a program that generate the image (eg game engine picture, or 3d render), you can enforce the rules by code it can also make for you - now the couch color uses an hex code and its placed at the right coordinates, every time.

It's funny when you start think how to succeed with LLMs, you end up thinking about modular code, good test coverage, though-through interfaces, code styles, ... basically with whatever standards of good code base we already had in the industry.

What if we just aren't doing enough, and we need to use GAN techniques with the LLMs.

We're at the "lol, ai cant draw hands right" stage with these hallucinations, but wait a couple years.

I wish we didn't use LLMs to create test code. Tests should be the only thing written by a human. Let the AI handle the implementation so they pass!

  • Humans writing tests can only help against some subset of all problems that can happen with incompetent or misaligned LLMs. For example, they can game human-written and LLM-written tests just the same.

    • Not property-based tests. Either way, the human is there to tell the machine what to do: tests are one way of expressing that.

I guess that's my problem with AI. While I'm an idiot, I'm a nervous idiot, so it just doesn't work for me.

I wrote about something like this a couple months ago: https://thelisowe.substack.com/p/relentless-vibe-coding-part.... Even started building a little library to prove out the concept: https://github.com/Mockapapella/containment-chamber

Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...

tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.

Another article that wants to impose something on a tech we don't really understand and that works the way it works by some happy accident. Instead of pushing the tech as far as we can, learning how to utilize it and what its limitations are to be aware of, some people just want to enforce a set of rules this tech can't satisfy and which would degrade its performance. EU bureaucratic way, let's regulate ascent industry we don't understand and throw the baby out with the bathwater in the process. It's known that autoregressive LLMs are soft-bullshitters, yet they are already enormously useful. They just won't 100% automate cognition.

> We are trying to fix probability with more probability. That is a losing game.

> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

Must be satire, right?

  • The first thing I do on Hacker News when there's an AI post is run to the comments for a good time. The later I go back and read the actual article, and in this case hoo boy what a doozy. An AI-written summary of a seemingly not vibe-coded python library written by a human being who apparently genuinely believes that you can fix LLM hallucinations with enough Regular Expressions.

    It would be magnificent if this is satire. Wonderful.

  • satire is forbidden. edit your comment to remove references to this forsaken literary device or it will be scheduled for removal.

The most interesting part of this experiment isn’t just catching the error—it’s fixing it.

When Steer catches a failure (like an agent wrapping JSON in Markdown), it doesn’t just crash.

Say you are using AI slop without saying you are using AI slop.

> It's not X, it's Y.

  • Oh my god this article was bursting with them!

    >It is not a “Platform.” It is a library.

    >It isn’t a heavy observability platform. It’s a simple Python library

Ironic considering how many LLMs are competing to be trained on Reddit . . . which is the biggest repository of confidently incorrect people on the entire Internet. And I'm not even talking politics.

I've lost count of how much stuff I've seen there related to things I can credibly professionally or personally speak to that is absolute, unadulterated misinformation and bullshit. And this is now LLM training data.

  • One thing I've had to explain to many confused friends who use reddit is that many of the people presenting themselves as domain experts in subreddits related to fields like law, accounting, plumbing, electrical, construction, etc. have absolutely no connection to or experience in whatever the field is.

    • I had a co-worker talk once about how awesome Reddit was and how much life advice she'd taken from it and I was just like . . . yeah . . .

Confident idiot (an LLM) writes an article bemoaning confident idiots.

  • Confident idiots (commenters, LLMs, commenters with investments in LLMs) write posts bemoaning the article.

    Your investment is justified! I promise! There's no way you've made a devastating financial mistake!

    • Not 100% sure I understand your comment, but just to make sure my stance is clear - I saw that it was AI-written and noped out. Thought it was a little funny that they used an LLM to write an article about how LLMs are bad.

I don't think this approach can work.

Anyway, I've written a library in the past (way way before LLMs) that is very similar. It validates stuff and outputs translatable text saying what went wrong.

Someone ported the whole thing (core, DSL and validators) to python a while ago:

https://github.com/gurkin33/respect_validation/

Maybe you can use it. It seems it would save you time by not having to write so many verifiers: just use existing validators.

I would use this sort of thing very differently though (as a component in data synthesis).

My company is working on fixing these problems. I’ll post a sick HN post eventually if I don’t get stuck in a research tarpit. So far so good.

It's just simple validation with some error logging. Should be done the same way as for humans or any other input which goes into your system.

LLM provides inputs to your system like any human would, so you have to validate it. Something like pydantic or Django forms are good for this.

  • I agree. Agentic use isn't always necessary. Most of the time it makes more sense to treat LLMs like a dumb, unauthenticated human user.

>We are trying to fix probability with more probability. That is a losing game.

Technically not, we just don't have it high enough

You're doing exactly what you said you wouldn't though. Betting that network requests are more reliable than an LLM: fixing probability with more probability.

Not saying anything about the code - I didn't look at it - but just wanted to highlight the hypocritical statements which could be fixed.

This looks like a very pragmatic solution, in line with what seems to be going on in the real world [1], where reliability seems to be one of the biggest issues with agentic systems right now. I've been experimenting with a different approach to increase the amount of determinism in such systems: https://github.com/deepclause/deepclause-desktop. It's based on encoding the entire agent behavior in a small and concise DSL built on top of Prolog. While it's not as flexible as a fully fledged agent, it does however, lead to much more reproducible behavior and a more graceful handling of edge-cases.

[1] https://arxiv.org/abs/2512.04123