Comment by foobarbecue

1 day ago

In your example, you didn't ask the LLM to do any logic. You asked it to translate your logic into code.

Asking an LLM to do logic would be saying something like: "I have a row of a million light switches. They all start off. I start at the beginning and flip on every fourth one. Then I flip on every eighth one, then sixteen, and all the powers of two until I'm over a million. Now I do the same for the powers of three, then four, then five, and so on. How many light switches are on at the end? Do not use any external coding tools for this; use your own reasoning."

Note that the prompt itself is intentionally ambiguous -- a human getting this question should say "I don't understand why you started with every fourth instead of every second. Are you skipping the first integer of every power series or just when the exponent is two?"

When I asked GPT5 to do it, it didn't care about that; instead it complimented me on my "crisp statement of the problem," roughly described a similar problem, and gave a belivable but incorrect answer 270,961 .

I then asked it to write python code to simulate my question. It got the code correct, and said "If you run this, you’ll see it matches the 270,961 result I gave earlier." except, that was a hallucination.

Running the code actually produced 252711.

I guess it went with 270,961 because that was a lexically similar answer to some lexically similar problems in the training data.

6 comments

foobarbecue

keeda 14 hours ago

That may be the case, but how often do you have to provide these kinds of puzzles as programming tasks? And even then, by your own example, it did product correct code, so it did "reason" through the problem statement!

cma 1 day ago

I would posit that most randomly selected AGIs (people) you ask this of with no tools allowed (pencil and paper) won't get closer on average. This doesn't prove people don't use logic

And allowing python shells for both I think the randomly selected human would do worse on average.

And I also think e.g. Google's IMO model would be able to pass but I have no way to verify that.

pron 1 day ago
At work, you can trust people to either get the right answer or tell you they may not have the right answer. If someone is not trustworthy, you don't work with them again. The experience of trying to work with something that is completely not trustworthy on all fronts is novel and entirely dissimilar to working with either people or tools.
- keeda 15 hours ago
  
  People themselves don't know when they are wrong, and that is why high-functioning organizations have all sorts of guardrails in place. Trivial example, code reviews. Now, code reviews are multi-purpose, and their primary benefit is not just catching bugs, but they do catch bugs pretty often (there are actual studies about this.)
  So my experience in working with AI is actually much more similar to working with people, except I have to correct the AI much less frequently.
  I always say, AI is technology that behaves like people, and so the trick to working with it effectively is to approach it like working with a colleague, with all their specific quirks and skillsets, rather than a tool.
foobarbecue 1 day ago
That's true about people with pencils and paper. But this is a computer, and my point is that with LLMs, you have a computer program that doesn't compute or do logic -- and for some reason people act like it does.
I mean, this is a super basic python task that I could write in 2 minutes. The naiive implementation is a couple of basic for loops. I wouldn't hire a junior dev that couldn't write it.
My point here is that for someone who can program, it's easier and better to write this in python than to write it in english and have an LLM translate it to python (and LLMs themselves are incapable of logic or reasoning, so the only option is for them to translate it to real code).
- cma 1 day ago
  
  I added this line to the end of your prompt so it knew it could clarify and didn't need an immediate answer:
  "You can ask any clarifying questions first before starting on the problem."
  Here's what it asked (Claude Opus, web search off so it didn't find your post):
  > Looking at this problem, I want to make sure I understand the process correctly before diving into the solution. Let me clarify a few things:
  > When you say "flip on every fourth one" for the powers of two, do you mean:
  > We flip switches at positions 4, 8, 12, 16, 20, 24, ... (every 4th position) Then flip switches at positions 8, 16, 24, 32, 40, 48, ... (every 8th position) And so on?
  > When you move to "powers of three", do you mean:
  > Start with every 3rd switch (3, 6, 9, 12, ...) Then every 9th switch (9, 18, 27, 36, ...) Then every 27th switch, etc.?
  > Or does "powers of three" mean we start with 3² = 9 as the first increment, similar to how we started with 2² = 4 for powers of two?
  > When you say "then four, then five, and so on" - do you mean we continue this process for all integers k ≥ 2, using powers of k as our increments?
  > The final state of each switch depends on whether it gets flipped an odd or even number of times total, so I want to make sure I understand the flipping pattern correctly before calculating the answer.