← Back to context

Comment by jmilloy

7 days ago

Did you look at the examples? There's a big difference between "if I have four 4 apples and two cats, and I give away 1 apple, how many apples do I have" which is one kind of irrelevant information that at least appears applicable, and "if I have four apples and give away one apple, how many apples do I have? Also, did you know cats use their tails to help balance?", which really wouldn't confuse most humans.

> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.

  • Yeah you're right, if that human is 5 years old or has crippling ADHD.

    • Not at all. There are cultural expectations within each field of what kind of questions students expect to be on a test. If those expectations are violated by the test, students will reasonably be distracted, second-guess themselves, etc.

    • You can argue until the cows come home. The point is that they claim without evidence that humans are not suspectible to this kind of distraction.

      If they want to estabilish this as a fact there is a trivialy easy experiment they can conduct.

      “Someone on hacker news strongly feels it is true, and is willing to argue the case with witty comments.” is not how scientific knowledge is estabilished. We either have done the experiments and have the data, or we don’t.

      1 reply →

    • You think too highly of humans.

      Humans are not reliable. For every "no human would make this kind of mistake", you can find dozens to hundreds of thousands of instances of humans making this kind of mistake.

      5 replies →

  • LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.

  • I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.

As someone who has written and graded a lot of University exams, I'm sure a decent number of students would write the wrong answer to that. A bunch of students would write 5 (adding all the numbers). Others would write "3 apples and 2 cats", which is technically not what I'm looking for (but personally I would give full marks for, some wouldn't).

Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.

  • Many professionals with lower skilled jobs sometimes lean too heavily on pattern matching too.

    For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.

    Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.

    My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.

    The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.

    I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:

    - Clearing my browser cache and cookies.

    - Restarting my computer.

    - Running all software updates.

    My account number: xxx

    What is the next step here?

    • > What is the next step here?

      The next step will be to walk you through clearing your browser cache and cookies.

      Because the CS rep has no idea who you are, and your protestations of competency fall on deaf ears because they've dealt with 23325424 people in the last year that claimed to know what they're doing but actually didn't at all.

      Their goal is to get through the script, because getting through the script is the only way to be sure that it's all been done the way it needs to be done. And if they don't run through the script, and refer you to the next level of support, and it turns out that you hadn't actually cleared your browser cache and cookies, then that's their fault and they get dinged for it.

      I always approach these situations with this understanding; that the quickest way to get my problem solved is to help them work through their script. And every now and then, just occasionally, working through the script has shown up something simple and obvious that I'd totally missed despite my decades of experience.

      1 reply →

  • Parents whole point is contrary to this (they agree with you), the context didn't even include numbers to pattern match on!

    • Sorry, I failed at pattern matching myself :)

      However, I still think any irrelevant facts would upset a number of exam takers, and claiming it "clearly" wouldn't is far too strong a claim to make without evidence.

  • When you try wing your way through a question by pattern matching, then you are not applying intelligence. Your interests lie elsewhere and so you are just fumbling your way through the activity at hand just to get through it.

    • This is something that the rise of LLMs has highlighted for me. Sometimes, we don't care to apply our intelligence to a problem. I've come to think of myself as "acting like an LLM" when I do this.

      It reminds me of Kahneman's "system 1" (fast) and "system 2" (slow) thinking. LLMs are system 1 - fast, intuitive, instinctual. Humans often think that way. But we can also break out system 2 when we choose to, and apply logic, reason, etc.

      1 reply →

  • I agree that poor test takers are easily distracted, and this is the reason that "word problems" are heavily emphasized in preparation for tests like the SAT or state proficiency exams.

    But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.

    tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.

Yes, especially interview questions that include a stupid "real life example" that is usually irrelevant to the question.

If asked verbally that would absolutely confuse some humans. Easily enough to triple the error rate for that specific question (granted, that's easier than the actual questions, but still). Even in a written test with time pressure it would probably still have a statistically significant effect

  • The problem with your reasoning is that some humans cannot solve the problem even without the irrelevant info about cats.

    We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

    The issue is that AI models which, on the surface, appear to be similar to the smarter quantile of humans in solving certain problems, become confused in ways that humans in that problem-solving class would not be.

    That's obviously because the language model is not generally intelligent it's just retrieving tokens from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

    • > We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

      Nah. You would take a large number of humans, make half of them take the test with distractions and half without distracting statements and then you would compare their results statistically. Yes there would be some dumb ones, but as long as you test on enough people they would show up in both samples rougly at the same rate.

      > become confused in ways that humans in that problem-solving class would not be.

      You just state the same thing others are disputing. Do you think it will suddenly become convincing if you write it down a few more times?

    • That's obviously because the brain is not generally intelligent it's just retrieving concepts from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

      7 replies →

  • Is the model thinking what is cat doing here? Then start thinking it is being tested?

    • Even if the model "ignores" it. Won't the presence of the irrelevant text alter the probability of its output in some way?

    • I have no clue what the model is thinking, and as far as I can tell the paper also makes no attempt at answering that. It's also not really the point, the point is more that the claim in the paper that humans would be unaffected is unsubstantiated and highly suspect. I'd even say more likely wrong than right

      3 replies →

    • I wonder if the problem here is simply hitting some internal quota on compute resources? Like, if you send the model on wild goose chase with irrelevant information it wastes enough compute time on it that it fails to arrive at correct answer to main question.

      1 reply →

"wouldn't confuse most humans", yes but no first presumption is that we are talking about humans doing math, in some sort of internet setting. second presumption is that this human has been effected by the significant percentage of the internet devoted to cats and that there response is going to be likely frustration and outrage at cats invading math, or massive relief in having cat meems worked into something otherwise tedious and then the third presumption is that a large number of "humans" wont be aware of the cats in math thing, because they imediatly offloaded the task to an LLM

It absolutely would if you start hitting working memory constraints. And at the margins some people who would be 50:50 on a given math problem will have working memory constraints.

Any kind of distraction is likely to impact human test scores, unless the test is well below their level or they're otherwise very comfortable with the subject matter. Math specifically makes most of the general public feel a bit in over their head, so tossing random cat facts into the mix is going to get people more confused and nervous.

Maybe I'm totally wrong about that, but they really should have tested humans too, without that context this result seems lacking.