Comment by keeda
2 days ago
> ... LLMs aren't very good at logical reasonsing.
I'm curious about what experiences led you to that conclusion. IME, LLMs are very good at the type of logical reasoning required for most programming tasks. E.g. I only have to say something like "find the entries with the lowest X and highest Y that have a common Z from these N lists / maps / tables / files / etc." and it spits out mostly correct code instantly. I then review it and for any involved logic, rely on tests (also AI-generated) for correctness, where I find myself reviewing and tweaking the test cases much more than the business logic.
But then I do all that for all code anyway, including my own. So just starting off with a fully-fleshed out chunk of code, which typically looks like what I'd pictured in my head, is a huge load off my cognitive shoulders.
The experience was that I once asked an LLM to write a simple function and it produced something very wrong that nothing with good reasoning abilities should ever do. Of course, a drunk or very tired human could have done the same mistake, but they would have at least told me that they were impaired and unsure of their work.
I agree that most of the time it does most simple tasks mostly right, but that's not good enough to truly "offload" my mental effort. Again, I'm not saying it's not useful, but more than working with a junior developer it's like working with a junior developer who may or may not be drunk or tired and doesn't tell you.
But mostly my point is that LLMs seem to do logical reasoning worse than other things they do better, such as generating prose or summarising a document. Of course, even then you can't trust them yet.
> But then I do all that for all code anyway, including my own
I don't, at least not constantly. I review other people's code only towards the very end of a project, and in between I trust that they tell me about any pertinent challenge or insight, precisely so that I can focus on other things unless they draw my attention to something I need to think about.
I still think that working with a coding assistant is interesting and even exciting, but the experience of not being able to trust anything, for me at least, is unlike working with another person or with a tool and doesn't yet allow me to focus on other things. Maybe with more practice I could learn to work with something I can't trust at all.
> working with a junior developer who may or may not be drunk or tired and doesn't tell you.
Bad news, friend.
Overall though, I think you're right. It's a lot like working with people. The things you might be missing are that you can get better at this with practice, and that once you are multiplexing multiple Claudes, you can become hyper efficient. These are things I'm looking into now.
Do I know these for a fact? Not yet. But, like any tool, I'm sure that the investment won't pay off right away.
What was the simple function?
I’m not sure what their simple function was but I tried to use Claude to recreate C++ code to implement the algorithms in this paper as practice for me in LLM use and it didn’t go well. But I’ll be the first to admit that I’m probably holding it wrong.
https://users.cs.duke.edu/~reif/paper/chen/graph/graph.pdf
Can't remember, but it was something very basic - a 10/15-line routine that a first-year student would write in 3 minutes if they knew the relevant API. The reason I asked the model in the first place is because I didn't know the API. If memory serves, the model inverted an if or a loop condition.
8 replies →
In your example, you didn't ask the LLM to do any logic. You asked it to translate your logic into code.
Asking an LLM to do logic would be saying something like: "I have a row of a million light switches. They all start off. I start at the beginning and flip on every fourth one. Then I flip on every eighth one, then sixteen, and all the powers of two until I'm over a million. Now I do the same for the powers of three, then four, then five, and so on. How many light switches are on at the end? Do not use any external coding tools for this; use your own reasoning."
Note that the prompt itself is intentionally ambiguous -- a human getting this question should say "I don't understand why you started with every fourth instead of every second. Are you skipping the first integer of every power series or just when the exponent is two?"
When I asked GPT5 to do it, it didn't care about that; instead it complimented me on my "crisp statement of the problem," roughly described a similar problem, and gave a belivable but incorrect answer 270,961 .
I then asked it to write python code to simulate my question. It got the code correct, and said "If you run this, you’ll see it matches the 270,961 result I gave earlier." except, that was a hallucination.
Running the code actually produced 252711.
I guess it went with 270,961 because that was a lexically similar answer to some lexically similar problems in the training data.
That may be the case, but how often do you have to provide these kinds of puzzles as programming tasks? And even then, by your own example, it did product correct code, so it did "reason" through the problem statement!
The point I'm making is that translating from english to python does not involve logical reasoning or any form of problem solving. You essentially replace english phrases with their python equivalents; it's a fuzzy version of looking things up in a dictionary.
It can translate, but it cannot reason or calculate. The python can calculate.
1 reply →
I would posit that most randomly selected AGIs (people) you ask this of with no tools allowed (pencil and paper) won't get closer on average. This doesn't prove people don't use logic
And allowing python shells for both I think the randomly selected human would do worse on average.
And I also think e.g. Google's IMO model would be able to pass but I have no way to verify that.
At work, you can trust people to either get the right answer or tell you they may not have the right answer. If someone is not trustworthy, you don't work with them again. The experience of trying to work with something that is completely not trustworthy on all fronts is novel and entirely dissimilar to working with either people or tools.
1 reply →
That's true about people with pencils and paper. But this is a computer, and my point is that with LLMs, you have a computer program that doesn't compute or do logic -- and for some reason people act like it does.
I mean, this is a super basic python task that I could write in 2 minutes. The naiive implementation is a couple of basic for loops. I wouldn't hire a junior dev that couldn't write it.
My point here is that for someone who can program, it's easier and better to write this in python than to write it in english and have an LLM translate it to python (and LLMs themselves are incapable of logic or reasoning, so the only option is for them to translate it to real code).
1 reply →
https://arstechnica.com/ai/2025/08/researchers-find-llms-are...
I'm not who you're replying to but I had a scenario where I needed to notice that a command had completed (exit code received) but keep listening for any output that was still buffered and only stop processing tokens after it had been quiet for a little bit.
Trying to get Claude to do this without introducing a deadlock and without exiting too early and leaving valuable output in the pipe was hellish.
It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at.
> It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at.
Yes, one of the terms for this phenomenon is "jagged frontier" from here:
https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...
This is also the main reason people have such diverging experiences and the discussion around AI is so bimodal in nature.