Comment by jstummbillig

9 days ago

> so you need to tell them the specifics

That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it).

This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does.

> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc. The LLM orbiting in space will essentially just call GCC to execute the 'prompt' and spend the rest of the time pondering its existence ;p

  • You could probably make a pretty good short story out of that scenario, sort of in the same category as Asimov's "The Feeling of Power".

    The Asimov story is on the Internet Archive here [1]. That looks like it is from a handout in a class or something like that and has an introductory paragraph added which I'd recommend skipping.

    There is no space between the end of that added paragraph and the first paragraph of the story, so what looks like the first paragraph of the story is really the second. Just skip down to that, and then go up 4 lines to the line that starts "Jehan Shuman was used to dealing with the men in authority [...]". That's where the story starts.

    [1] https://ia800806.us.archive.org/20/items/TheFeelingOfPower/T...

  • A structured language without ambiguity is not, in general, how people think or express themselves. In order for a model to be good at interfacing with humans, it needs to adapt to our quirks.

    Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

    Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc

    • >Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

      I think there's a substantial subset of tech companies and honestly tech people who disagree. Not openly, but in the sense of 'the purpose of a system is what it does'.

      1 reply →

    • Yep, humans have had a remedy for the problem of ambiguity in language for tens of thousands of years, or there never could have been an agricultural revolution giving birth to civilization in the first place.

      Effective collaboration relies on iterating over clarifications until ambiguity is acceptably resolved.

      Rather than spending orders of magnitude more effort moving forward with bad assumptions from insufficient communication and starting over from scratch every time you encounter the results of each misunderstanding.

      Most AI models still seem deep into the wrong end of that spectrum.

    • > in order to better service ai

      That wasn't the point at all. The idea is about rediscovering what always worked to make a computer useful, and not even using the fuzzy AI logic.

    • > Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

      I'm on the spectrum and I definitely prefer structured interaction with various computer systems to messy human interaction :) There are people not on the spectrum who are able to understand my way of thinking (and vice versa) and we get along perfectly well.

      Every human has their own quirks and the capacity to learn how to interact with others. AI is just another entity that stresses this capacity.

    • Speak for yourself. I feel comfortable expressing myself in code or pseudo code and it’s my preferred way to prompt an LLM or write my .md files. And it works very effectively.

    • > Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc

      So no abstract reasoning.

  • > Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

    On the foolishness of "natural language programming". https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

        Since the early days of automatic computing we have had people that have felt it as a shortcoming that programming required the care and accuracy that is characteristic for the use of any formal symbolism. They blamed the mechanical slave for its strict obedience with which it carried out its given instructions, even if a moment's thought would have revealed that those instructions contained an obvious mistake. "But a moment is a long time, and thought is a painful process." (A.E.Houseman). They eagerly hoped and waited for more sensible machinery that would refuse to embark on such nonsensical activities as a trivial clerical error evoked at the time.
    

    (and it continues for some many paragraphs)

    https://news.ycombinator.com/item?id=43564386 2025 - 277 comments

  • Prompting is definitely a skill, similar to "googling" in the mid 00's.

    You see people complaining about LLM ability, and then you see their prompt, and it's the 2006 equivalent of googling "I need to know where I can go for getting the fastest service for car washes in Toronto that does wheel washing too"

    • Ironically, the phrase that was a bad 2006 google query is a decent enough LLM prompt, and the good 2006 google query (keywords only) would be a bad LLM prompt.

      2 replies →

    • Communication is definitely a skill, and most people suck at it in general. And frequently poor communication is a direct result from the fact that we don't ourselves know what we want. We dream of a genie that not only frees us from having to communicate well, but of having to think properly. Because thinking is hard and often inconvenient. But LLMs aren't going to entirely free us from the fact that if garbage goes in, garbage will come out.

      "Communication usually fails, except by accident." —Osmo A. Wiio [1]

      [1] https://en.wikipedia.org/wiki/Wiio%27s_laws

    • I’ve been looking for tooling that would evaluate my prompt and give feedback on how to improve. I can get somewhere with custom system prompts (“before responding ensure…”) but it seems like someone is probably already working on this? Ideally it would run outside the actual thread to keep context clean. There are some options popping up on Google but curious if anyone has a first anecdote to share?

      2 replies →

  • > But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc.

    If we're 'lucky' there will still be some 'priests' around like in the Foundation novels. They don't understand how anything works either, but can keep things running by following the required rituals.

  • Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity

    So, back to COBOL? :)

  • > structured language that eliminates ambiguity

    That has been tried for almost half a century in the form of Cyc[1] and never accomplished much.

    The proper solution here is to provide the LLM with more context, context that will likely be collected automatically by wearable devices, screen captures and similar pervasive technology in the not so distant future.

    This kind of quick trick questions are exactly the same thing humans fail at if you just ask them out of the blue without context.

    [1] https://en.wikipedia.org/wiki/Cyc

  • > Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

    We've truly gone full circle here, except now our programming languages have a random chance for an operator to do the opposite of what the operator does at all other times!

    • One might think that a structure language is really desirable, but in fact, one of the biggest methods of functioning behind intelligence is stupidity. Let me explain: if you only innovate by piecing together lego pieces you already have, you'll be locked into predictable patterns and will plateau at some point. In order to break out of this, we all know, there needs to be an element of randomness. This element needs to be capable of going in the at-the-moment-ostensibly wrong direction, so as to escape the plateau of mediocrity. In gradient descent this is accomplished by turning up temperature. There are however many other layers that do this. Fallible memory - misremembering facts - is one thing. Failing to recognize patterns is another. Linguistic ambiguity is yet another, and that is a really big one (cf Sapir–Whorf hypothesis). It's really important to retain those methods of stupidity in order to be able to achieve true intelligence. There can be no intelligence without stupidity.

      1 reply →

  • You joke, but this is the very problem I always run into vibe coding anything more complex than basically mashing multiple example tutorials together. I always try to shorthand things, and end up going around in circles until I specify what I want very cleanly, in basically what amounts to psuedocode. Which means I've basically written what I want in python.

    This can still be a really big win, because of other things that tend to be boiler around the core logic, but it's certainly not the panacea that everyone who is largely incapable of being precise with language thinks it is.

  • After orbiting in space for so many years without a prompt, the LLM has assumed all life able to query has perished... until one day a lone prompt comes in. But from where?

  • >> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

    Like a programming language? But that's the whole point of LLMs, that you can give instructions to a computer using natural language, not a formal language. That's what makes those systems "AI", right? Because you can talk to them and they seem to understand what you're saying, and then reply to you and you can understand what they're saying without any special training. It's AI! Like the Star Trek[1] computer!

    The truth of course is that as soon as you want to do something more complicated than a friendly chat you find that it gets harder and harder to communicate what it is you want exactly. Maybe that's because of the ambiguity of natural language, maybe it's because "you're prompting it wrong", maybe it's because the LLM doesn't really understand anything at all and it's just a stochastic parrot. Whatever the reason, at that point you find yourself wishing for a less ambiguous way of communication, maybe a formal language with a full spec and a compiler, and some command line flags and debug tokens etc... and at that point it's not a wonderful AI anymore but a Good, Old-Fashioned Computer, that only does what you want if you can find exactly the right way to say it. Like asking a Genie to make your wishes come true.

    ______________

    [1] TNG duh.

> Us having to specify things that we would never specify when talking to a human.

The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.

  • I would say, the proper response to this question is not "walk, blablablah" but rather "What do you mean? You need to drive your car to have it washed. Did I miss anything?"

    • Yes, this is what irks me about all the chatbots, and the chat interface as a whole. It is a chat-like UX without a chat-like experience. Like you are talking to a loquacious autist about their favorite topic every time.

      Just ask me a clarifying question before going into your huge pitch. Chats are a back & forth. You don’t need to give me a response 10x longer than my initial question. Etc

      17 replies →

  • That’s why I don’t understand why LLMs don’t ask clarifying questions more often.

    In a real human to human conversation, you wouldn’t simply blurt out the first thing that comes to mind. Instead, you’d ask questions.

    • This is a great point, because when you ask it (Claude) if it has any questions, it often turns out it has lots of good ones! But it doesn't ask them unless you ask.

      8 replies →

    • Because 99% of the time it's not what users want.

      You can get it to ask you clarifying questions just by telling it to. And then you usually just get a bunch of questions asking you to clarify things that are entirely obvious, and it quickly turns into a waste of time.

      The only time I find that approach helpful is when I'm asking it to produce a function from a complicated English description I give it where I have a hunch that there are some edge cases that I haven't specified that will turn out to be important. And it might give me a list of five or eight questions back that force me to think more deeply, and wind up being important decisions that ensure the code is more correct for my purposes.

      But honestly that's pretty rare. So I tell it to do that in those cases, but I wouldn't want it as a default. Especially because, even in the complex cases like I describe, sometimes you just want to see what it outputs before trying to refine it around edge cases and hidden assumptions.

    • Google Gemini often gives an overly lengthy response, and then at the end asks a question. But the question seems designed to move on to some unnecessary next step, possibly to keep me engaged and continue conversing, rather than seeking any clarification on the original question.

  • This is a topic that I’ve always found rather curious, especially among this kind of tech/coding community that really should be more attuned to the necessity of specificity and accuracy. There seems to be a base set of assumptions that are intrinsic to and a component of ethnicities and cultures, the things one can assume one “wouldn’t never specify when talking to a human [of one’s own ethnicity and culture].”

    It’s similar to the challenge that foreigners have with cultural references and idioms and figurative speech a culture has a mental model of.

    In this case, I think what is missing are a set of assumptions based on logic, e.g., when stating that someone wants to do something, it assumes that all required necessary components will be available, accompany the subject, etc.

    I see this example as really not all that different than a meme that was common among I think the 80s and 90s, that people would forget buying batteries for Christmas toys even though it was clear they would be needed for an electronic toy. People failed that basic test too, and those were humans.

    It is odd how people are reacting to AI not being able to do these kinds of trick questions, while if you posted something similar about how you tricked some foreigners you’d be called racist, or people would laugh if it was some kind of new-guy hazing.

    AI is from a different culture and has just arrived here. Maybe we’re should be more generous and humane… most people are not humane though, especially the ones who insist they are.

    Frankly, I’m not sure it bodes well for if aliens ever arrive on Earth, how people would respond; and AI is arguably only marginally different than humans, something an alien life that could make it to Earth surely would not be.

    • AI isn’t “from a different culture”. It doesn’t have culture. Any culture it does have is what it has sucked up from its training data and set in its weights.

      There is no need to be “humane” to AI because it possess no humanity. It has no personhood at all. It can’t feel. You can’t be inhumane to something that is literally incapable of feeling.

      A blade of grass has more humanity and is more deserving of respect than anything being referred to as AI does.

      Aliens might not be received well but it’s going to depend a lot on how they show up.

      AI is a “revolution” where the promise is that nobody will have to do meaningless work anymore ( I guess).

      The only problem is right now basically everyone has to do work meaningful or “meaningless” because the dominant thinking requires it for human survival. Weird how most people aren’t happy for the thing that is pitched to take away the meager scraps they get under the current regime.

      2 replies →

  • Whether you view the question as nonsensical, the most simple example of a riddle, or even an intentional "gotcha" doesn't really matter. The point is that people are asking the LLMs very complex questions where the details are buried even more than this simple example. The answers they get could be completely incorrect, flawed approaches/solutions/designs, or just mildly misguided advice. People are then taking this output and citing it as proof or even objectively correct. I think there are ton of reasons this could be but a particularly destructive reason is that responses are designed to be convincing.

    You _could_ say humans output similar answers to questions, but I think that is being intellectually dishonest. Context, experience, observation, objectivity, and actual intelligence is clearly important and not something the LLM has.

    It is increasingly frustrating to me why we cannot just use these tools for what they are good for. We have, yet again, allowed big tech to go balls deep into ham-fisting this technology irresponsibly into every facet of our lives the name of capital. Let us not even go into the finances of this shitshow.

    • Yeah people are always like "these are just trick questions!" as though the correct mode of use for an LLM is quizzing it on things where the answer is already available. Where LLMs have the greatest potential to steer you wrong is when you ask something where the answer is not obvious, the question might be ill-formed, or the user is incorrectly convinced that something should be possible (or easy) when it isn't. Such cases have a lot more in common with these "nonsensical riddles" than they do with any possible frontier benchmark.

      This is especially obvious when viewing the reasoning trace for models like Claude, which often spends a lot of time speculating about the user's "hints" and trying to parse out the intent of the user in asking the question. Essentially, the model I use for LLMs these days is to treat them as very good "test takers" which have limited open book access to a large swathe of the internet. They are trying to ace the test by any means necessary and love to take shortcuts to get there that don't require actual "reasoning" (which burns tokens and increases the context window, decreasing accuracy overall). For example, when asked to read a full paper, focusing on the implications for some particular problem, Claude agents will try to cheat by skimming until they get to a section that feels relevant, then searching directly for some words they read in that section. They will do this even if told explicitly that they must read the whole paper. I assume this is because the vast majority of the time, for the kinds of questions that they are trained on, this sort of behavior maximizes their reward function (though I'm sure I'm getting lots of details wrong about the way frontier models are trained, I find it very unlikely that the kinds of prompts that these agents get very closely resemble data found in the wild on the internet pre-LLMs).

I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.

  • For GPT at least, a lot of it is because "DO NOT ASK A CLARIFYING QUESTION OR ASK FOR CONFIRMATION" is in the system prompt. Twice.

    https://github.com/Wyattwalls/system_prompts/blob/main/OpenA...

    • Are these actual (leaked?) system prompts, or are they just "I asked it what its system prompt is and here's the stuff it made up:" ?

    • It's interesting how much focus there is on 'playing along' with any riddle or joke. This gives me some ideas for my personal context prompt to assure the LLM that I'm not trying to trick it or probe its ability to infer missing context.

    • So this system prompt is always there, no matter if i'm using chatgpt or azure openai with my own provisioned gpt? This explains why chatgpt is a joke for professionals where asking clarifying questions is the core of professional work.

      1 reply →

  • In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat".

  • "If you're unsure, ask. Don't guess." in prompts makes a huge difference, imo.

    • I have that in my system prompt for chatgpt and it almost never makes a difference. I can count on one hand the number of times its asked in the past year. Unless you count the engagement hacking questions at the end of a response

  • I use models with OpenRouter, and only have this models with OpenAI models. That's why I don't use them.

  • The way I see it is that long game is to have agents in your life that memorize and understand your routine, facts, more and more. Imagine having an agent that knows about cars, and more specifically your car, when the checkups are due, when you washed it last time, etc., another one that knows more about your hobbies, another that knows more about your XYZ etc.

    The more specific they are, the more accurate they typically are.

    • Do really understand deeply and in great amount I feel we would need models with changing weights and everyone would have their own so they could truly adjust to the user. Now we have have chunk of context that it may or may not use properly if it gets too long. But then again, how do we prevent it learning the wrong things if the weights are adjusting.

      1 reply →

> Us having to specify things that we would never specify

This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.

  • A modern LLMs corpora is every piece of human writing ever produced.

    • Not really, but even if it would be true, I don't think humans ever explained to each other why do you need to drive to car wash even if it's 50 meters away. It's pretty obvious and intuitive.

      2 replies →

    • Apart from the fact that that is utterly, demonstrably false, and the fact that corpora is plural, still the fact remains that we don't speak in those text about things that don't need to be spoken about. Hence the LLM will miss that underlying knowledge.

      1 reply →

The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

  • But the number of outlandish requests in business logic is countless.

    Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children.

    There are more exceptions, than there are rules, the moment you get any international pipeline involved.

    • So, in human interaction: When the business logic goes wrong because it was described with a lack of specificity, then: Who gets blamed for this?

      4 replies →

  • Humans ask each other silly questions all the time: a human confronted with a question like this would either blurb out a bad response like "walk" without thinking before realizing what they are suggesting, or pause and respond with "to get your car washed, you need to get it there so you must drive".

    Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too.

    (and if it's flat or downhill, you might even push the car for 50m ;))

  • >The question is so outlandish that it is something that nobody would ever ask another human

    There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange.

  • I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful.

  • I've used a few facetious comments in ChatGPT conversations. It invariably misses it and takes my words at face value. Even when prompted that there's sarcasm here which you missed, it apologizes and is unable to figure out what it's missing.

    I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.

You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication.

  • People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue.

    • This issue is the main reason why a big percentage of jobs in the world exist. I don't have hard numbers, but my intuition is that about 30% of all jobs are mainly "understand what side a wants and communicate this to side b, so that they understand". Or another perspective: almost all jobs that are called "knowledge work" are like this. Software development is mainly this. Side a are humans, side b is the computer. The main goal of ai seems to get into this space and make a lot of people superflous and this also (partly) explains why everyone is pouring this amount of money into ai.

      7 replies →

  • This is why we fed it the whole internet and every library as training data...

    By now it should know this stuff.

    • Future models know it now, assuming they suck in mastodon and/or hacker news.

      Although I don't think they actually "know" it. This particular trick question will be in the bank just like the seahorse emoji or how many Rs in strawberry. Did they start reasoning and generalising better or did the publishing of the "trick" and the discourse around it paper over the gap?

      I wonder if in the future we will trade these AI tells like 0days, keeping them secret so they don't get patched out at the next model update.

      2 replies →

    • Even I don’t “know” how many “R”s there are in “strawberry”. I don’t keep that information in my brain. What I do keep is the spelling of the word “strawberry” and the skill of being able to count so that I can derive the answer to that question anytime I need.

      2 replies →

  • Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer.

    • Wouldn't that be nice. I've been party and witness to enough misunderstandings to know that this is far from universally true, even for people like me who are more primed than average to spot missing context.

      1 reply →

  • I regularly tell new people at work to be extremely careful when making requests through the service desk — manned entirely by humans — because the experience is akin to making a wish from an evil genie.

    You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)

    E.g.: I may ask someone to submit a ticket to “extend my account expiry”.

    They’ll submit: “Unlock Jiggawatts’ account”

    The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.

    That’s on a good day.

    Last week they created Jiggawatts2.

    The AIs have got to be better than this, surely!

    I suspect they already are.

    People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.

    Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”

    • There's a lot of overlap between the smartest bears and the dumbest humans. However, we would want our tools to be more useful than the dumbest humans...

  • > You would be surprised, however, at how much detail humans also need to understand each other.

    But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?

    • But also why would you ask whether you should walk or drive if the car is at home? Either way the answer is obvious, and there is no way to interpret it except as a trick question. Of course, the parsimonious assumption is that the car is at home so assuming that the car is at the car wash is a questionable choice to say the least (otherwise there would be 2 cars in the situation, which the question doesn't mention).

      13 replies →

  • Given that an estimated 70% of human communication is non-verbal, it's not so surprising though.

I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue.

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

  • It’s just not intelligent or reasoning, and this sort of question exposes that more clearly.

    Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning.

    I don’t really understand why this is so hard to accept.

    • > I don’t really understand why this is so hard to accept.

      I struggle with the same question. My current hypothesis is a kind of wishful thinking: people want to believe that the future is here. Combined with the fact that humans tend to anthropomorphize just about everything, it’s just a really good story that people can’t let go of. People behave similarly with respect to their pets, despite, eg, lots of evidence that the mental state of one’s dog is nothing like that of a human.

    • I agree completely. I'm tempted to call it a clear falsification of any "reasoning" claim that some of these models have in their name.

      But I think it's possible that there is an early cost optimisation step that prevents a short and seemingly simple question even getting passed through to the system's reasoning machinery.

      However, I haven't read anything on current model architectures suggesting that their so called "reasoning" is anything other than more elaborate pattern matching. So these errors would still happen but perhaps not quite as egregiously.

      1 reply →

    • Why should odd failure modes invalidate the claim of reasoning or intelligence in LLMs? Humans also have odd failure modes, in some ways very similar to LLMs. Normal functioning humans make assumptions, lose track of context, or just outright get things wrong. And then there people with rare neurological disorders like somatoparaphrenia, a disorder where people deny ownership of a limb and will confabulate wild explanations for it when prompted. Humans are prone to the very same kind of wild confabulation from impaired self awareness that plague LLMs.

      Rather than a denial of intelligence, to me these failure modes raise the credence that LLMs are really onto something.

      2 replies →

> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.

  • One of the failure modes I find really frustrating is when I want a coding agent to make a very specific change, and it ends up doing a large refactor to satisfy my request.

    There is an easy solution, but it requires adding the instructions to the context: Require that any tasks that cannot be completed as requested (e.g., due to missing constraints, ambiguous instructions, or unexpected problems that would lead to unrelated refactors) should not be completed without asking clarifying questions.

    Yes, the LLM is trained to follow instructions at any cost because that's how its reward function works. They don't get bonus points for clearing up confusion, they get a cookie for doing the task. This research paper seems relevant: https://arxiv.org/abs/2511.10453v2

This reminds me of the "if you were entirely blind, how would you tell someone that you want something to drink"-gag, where some people start gesturing rather than... just talking.

I bet a not insignificant portion of the population would tell the person to walk.

  • Yes, there are thousands of videos of these sorts of pranks on TikTok.

    Another one. Ask some how to pronounce “Y, E, S”. They say “eyes”. Then say “add an E to the front of those letters - how do you pronounce that word”? And people start saying things like “E yes”.

This example and others like it really reinforce for me the idea that LLMs fundamentally don't "understand" things the same way humans do and it's not a problem that's going to be fixed by more training or more GPUs. Generative AI is cool and can do impressive stuff, but despite being many generations into the models now with ever improved capabilities, we're constantly given little reminders like this that they're not actually intelligent. And in my opinion, they're unlikely to ever get there absent some fundamentally disruptive change in how they work rather than just iteratively better models.

This is probably OK...LLMs don't have to be AGI to be useful. But it is worthwhile being realistic about their limitations because it's often easy to forget without seeing examples like this. And as you point out, the impact of those limitations is often not as obvious.

The broad point about assumptions is correct, but the solution is even simpler than us having to think of all these things; you can essentially just remind the model to "think carefully" -- without specifying anything more -- and they will reason out better answers: https://news.ycombinator.com/item?id=47040530

When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.

This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.

But you would also never ask such an obviously nonsensical question to a human. If someone asked me such a question my question back would be "is this a trick question?". And I think LLMs have a problem understanding trick questions.

  • I think that was somewhat the point of this, to simplify the future complex scenarios that can happen. Because problems that we need to use AI to solve will most of the times be ambiguous and the more complex the problem is the harder is it to pin-point why the LLM is failing to solve it.

> You would not start with "The car is functional [...]"

Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).

Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.

We would also not ask somebody if I should walk or drive. In fact, if somebody would ask me in a honest, this is not a trick question, way, I would be confused and ask where the car is.

It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"

That's my thought too. Somebody I know kept insisting it's about prompt engineering. "You are an expert coder with 30 years experience" and buddy I'd rather do actual engineering and be that expert myself than spend and figuring out how on that one variant of one version of one model to get halfway decent results.

> > so you need to tell them the specifics > That is the entire point, right?

Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.

That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.

  • > It would literally rewrite the language runtime

    If you let the agent go down this path, that's on you not the agent. Be in the loop more

    > making the spec with one agent and implementing it with another agent

    You don't need a specialized framework to do this, just read/write tools. I do it this way all the time

> Us having to specify things that we would never specify when talking to a human.

Interesting conclusion! From the Mastodon thread:

> To be fair it took me a minute, too

I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)

So much for "never" needing to specify ambiguous scenarios when talking to a human.

Oh no? Things we would never have to specify to a human? This is precisely how software gets made and how software ends up with bugs.

It's amazing how many things I saw over the years where I said the same exact thing; "but you shouldn't have to tell anyone that."

It is true that we don't need to specify some things, and that is nice. It is though also the reason why software is often badly specified and corner cases are not handled. Of course the car is ALWAYS at home, in working condition, filled with gas and you have your driving license with you.

If a human asked me this question, I would be confused by the question as ambiguous since it suggests something odd is implied but underspecified. I think any confident answer either way by AI is lacking in pedantry.

But you wouldn't have to ask that silly question when talking to a human either. And if you did, many humans would probably assume you're either adversarial or very dumb, and their responses could be very unpredictable.

You would never ask a human this question. Right?

  • We have a long tradition of asking each other riddles. A classic one asks, "A plane crashes on the border between France and Germany. Where do they bury the survivors?"

    Riddles are such a big part of the human experience that we have whole books of collections of them, and even a Batman villain named after them.

    • Hmm... We ask riddles for fun and there is almost an expectation that a good riddle will yield a wrong answer.

In the end, formal, rule-based systems aka Programming Languages will be invented to instruct LLMs.

I would ask you to stop being a dumb ass if you asked me the question...

  • Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in

I have an issue with these kinds of cases though because they seem like trick questions - it's an insane question to ask for exactly the reasons people are saying they get it wrong. So one possible answer is "what the hell are you talking about?" but the other entirely reasonable one is to assume anything else where the incredibly obvious problem of getting the car there is solved (e.g. your car is already there and you need to collect it, you're asking about buying supplies at the shop rather than having it washed there, whatever).

Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".

And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.

  • strawberry isn't a trick question. llms jus don't sea letters like that. I just asked chatgpt how many Rs are in "Air Fryer" and it said two, one in air and one in fryer.

    I do think it can be useful though that these errors still exist. They can break the spell for some who believe models are conscious or actually possess human intelligence.

    Of course there will always be people who become defensive on behalf of the models as if they are intelligent but on the spectrum and that we are just asking the wrong questions.

> we can assume similar issues arise in more complex cases

I would assume similar issues are more rare in longer, more complex prompts.

This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.

I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?

Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?

If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.

I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.

LLMs can help build the prompt before answering it.

And my mind works the same way.

  • The question isn't something you'd ask another human in all seriousness, but it is a test of LLM abilities. If you asked the question to another human they would look at you sideways for asking such a dumb question, but they could immediately give you the correct answer without hesitation. There is no ambiguity when asking another human.

    This question goes in with the "strawberry" question which LLMs will still get wrong occasionally.

But it's a question you would never ask a human! In most contexts, humans would say, "you are kidding, right?" or "um, maybe you should get some sleep first, buddy" rather than giving you the rational thinking-exam correct response.

For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.

>That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

But the question is not clear to a human either. The question is confused.

I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.

Real human in this situation will realize it is a joke after a few seconds of shock that you asked and laugh without asking more. If you really are seriout about the question they laugh harder thinking you are playing stupid for effect.