Comment by pedrosorio
1 year ago
> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output
This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.
Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).
I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.
The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.
Does fluency in English make someone good at solving trick questions? I usually don’t even bother trying but mostly because trick questions don’t fit my definition of entertaining.
8 replies →
It's knowledge is broad and general, it does not have insight into the specifics of a person's discussion style, there are many humans that struggle with distinguishing sarcasm for instance. Hard to fault it for not being in alignment with the speaker and their strangely phrased riddle.
It answers better when told "solve the below riddle".
lol, I am neither a PhD nor a postdoc, but I am from India . I could understand the problem.
Did you have English as your medium of instruction? If yes, do you see the irony that you also couldn’t read two sentences and see the facts straight?
I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.
This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.
Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.
> it can't answer the unmisleading version.
Yes it can: https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
“Don’t be mean to LLMs, it isn’t their fault that they’re not actually intelligent”
In general LLMs seem to function more reliably when you use pleasant language and good manners with them. I assume this is because because the same bias also shows up in the training data.
"Don't anthropomorphize LLMs. They're hallucinating when they say they love that."
This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.
Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.
I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.
My codebases are riddled with these gotchas. For instance, I sometimes write Python for the Blender rendering engine. This requires highly non-idiomatic Python. Whenever something complex comes up, LLM's just degenerate to cookie cutter basic bitch Python code. There is simply no "there" there. They are very useful to help you reason about unfamiliar codebases though.
For me the best coding use case is getting up to speed in an unfamiliar library or usage. I describe the thing I want and get a good starting point and often the cookie-cutter way is good enough. The pre-LLM alternative would be to search for tutorials but they will talk about some slightly different problem with different goals etc then you have to piece it together, and the tutorial assumes you already know a bunch of things like how to initialize stuff and skips the boilerplate and so on.
Now sure, actually working through it will give a deeper understanding that might come handy at a later point, but sometimes the thing is really a one-off and not an important point. Like as an AI researcher I sometimes want to draft up a quick demo website, or throw together a quick Qt GUI prototype or a Blender script or use some arcane optimization library or write a SWIG or a Cython wrapper around a C/C++ library to access it in Python, or how to stuff with Lustre, or the XFS filesystem or whatever. Any number of small things where, sure, I could open the manual, do some trial and error, read stack overflow, read blogs and forums, OR I could just use an LLM, use my background knowledge to judge whether it looks reasonable, then verify it, use the now obtained key terms to google more effectively etc. You can't just blindly copy-paste it and you have to think critically and remain in the driver seat. But it's an effective tool if you know how and when to use it.
1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.
(a) Sometimes things are useful even when imperfect e.g. search engines.
(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!
I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.
2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.
[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...
Reminds me of a trick question about Schrödinger's cat.
“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”
The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.
There is no "trick" in the linked question, unlike the question you posed.
The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.
Yes there is. The trick is that the more common variant of this riddle says that a boy and his father are in the car accident. That variant of the riddle certainly comes up a lot in the training data, which is directly analogous to the Schrödinger case from above where smuggling in the word "dead" is analogous to swapping father to mother in the car accident riddle.
I think many here are not aware that the car accident riddle is well known with the father dying where the real solution is indeed that the doctor is the mother.
There is a trick. The "How is this possible?" primes the LLM that there is some kind of trick, as that phrase wouldn't exist in the training data outside of riddles and trick questions.
The trick in the original question is that it's a twist on the original riddle where the doctor is actually the boys mother. This is a fairly common riddle and I'm sure the LLM has been trained on it.
Yeah, I think what a lot of people miss about these sort of gotchas are that most of them were invented explicitly to gotcha humans, who regularly get got by them. This is not a failure mode unique to LLMs.
One that trips up LLMs in ways that wouldn't trip up humans is the chicken, fox and grain puzzle but with just the chicken. They tend to insist that the chicken be taken across the river, then back, then across again, for no reason other than the solution to the classic puzzle requires several crossings. No human would do that, by the time you've had the chicken across then even the most unobservant human would realize this isn't really a puzzle and would stop. When you ask it to justify each step you get increasingly incoherent answers.
Has anyone tried this on o1?
1 reply →
If there is attention mechanism then maybe that is what is fault, because if it is a common riddle attention mechanism only notices that it is a common riddle, not that there is a gotcha planted in. Because when I read the sentence myself, I did not immediately notice that the cat that was put in there was actually dead when it was put there, because I pattern matched this to a known problem, I did not think I need to pay logical attention to each word, word by word.
Yes it's so strange seeing people who clearly know these are 'just' statistical language models pat themselves on the back when they find limits on the reasoning capabilities - capabilities which the rest of us are pleasantly surprised exist to the extent they do in a statistical model, and happy to have access to for $20/mo.
1 reply →
What I'm not able to comprehend is why people are not seeing the answer as brilliant!
Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.
Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).
In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).
The 'riddle': A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?
GPT Answer: The doctor is the boy's mother
Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)
This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.
It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.
LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.
The original riddle is of course:
"A father and his son are in a car accident [...] When the boy is in hospital, the surgeon says: This is my child, I cannot operate on him".
In the original riddle the answer is that the surgeon is female and the boy's mother. The riddle was supposed to point out gender stereotypes.
So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.
5 replies →
It literally is a riddle, just as the original one was, because it tries to use your expectations of the world against you. The entire point of the original, which a lot of people fell for, was to expose expectations of gender roles leading to a supposed contradiction that didn't exist.
You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.
That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.
4 replies →
Why couldn't the doctor be the boys mother?
There is no indication of the sex of the doctor, and families that consist of two mothers do actually exist and probably doesn't even count as that unusual.
10 replies →
"There are four lights"- GPT will not pass that test as is. I have done a bunch of homework with Claude's help and so far this preview model has much nicer formatting but much the same limits of understanding the maths.
I mean, it's entirely possible the boy has two mothers. This seems like a perfectly reasonable answer from the model, no?
3 replies →
> why would anyone out of blue ask such question
I would certainly expect any person to have the same reaction.
> So, it started its chain of thought with "Interpreting the riddle" (smart!).
How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.
The doctor is obviously a parent of the boy. The language tricks simply emulate the ambiance of reasoning. Similarly to a political system emulating the ambiance of democracy.
Come on. Of course chatgpt has read that riddle and the answer 1000 times already.
It hasn't read that riddle because it is a modified version. The model would in fact solve this trivially if it _didn't_ see the original in its training. That's the entire trick.
2 replies →
Why does it exist 1000 times in the training if there isn't some trick to it, i.e. some subset of humans had to have answered it incorrectly for the meme to replicate that extensively in our collective knowledge.
And remember the LLM has already read a billion other things, and now needs to figure out - is this one of them tricky situations, or the straightforward ones? It also has to realize all the humans on forums and facebook answering the problem incorrectly are bad data.
Might seem simple to you, but it's not.
I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.
They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".
I think it may answer correctly if you start off asking "Please solve the below riddle:"
There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).
> They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".
How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.
Go read the whole riddle, add the rest of it and you'll see it's contrived, hence it's a riddle even for humans. The model in it's thinking (which you can read) places undue influence on certain anomalous factors. In practice, a person would say this way more eloquently than the riddle.
Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.
> especially when evaluated on the first snap answer
The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.
2 replies →
Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.
I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero
Nondeterminism provides an excuse for errors, determinism doesn't.
Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.
Would picking deterministically help through? Then in some cases it’s always 100% wrong
Yes, it is better if for example using it via an API to classify. Deterministic behavior makes it a lot easier to debug the prompt.
Determinism only helps if you always ask the question with exactly the same words. There's no guarantee a slightly rephrased version will give the same answer, so a certain amount of unpredictability is unavoidable anyway. With a deterministic LLM you might find one phrasing that always gets it right and a dozen basically indistinguishable ones that always get it wrong.
My program always asks the same question yes.
what's weird is it gets it right when I try it.
https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
That’s not weird at all, it’s how LLMs work. They statistically arrive at an answer. You can ask it the same question twice in a row in different windows and get opposite answers. That’s completely normal and expected, and also why you can never be sure if you can trust an answer.
Perhaps OpenAI hot-patches the model for HN complaints:
While that's not impossible, what we know of how the technology works (ie very costly training run followed by cheap inference steps) means that's not feasible, given all the possible variations of the question * is_hn_trick_prompt* would have to cover because there's a near infinite variations on how you'd word the prompt. (Eg The first sentence could be reworded to be "A woman and her son are in a car accident. " to "A woman and her son are in the car when they get into a crash.")
Waat, got it on second try:
This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.
Yep. correct and correct.
https://chatgpt.com/share/66e3de94-bce4-800b-af45-357b95d658...
The reason why that question is a famous question is that _many humans get it wrong_.