Comment by pron
2 days ago
> For me LLM:s are just a computer interface you can program using natural language.
I wish they were, but they're not that yet because LLMs aren't very good at logical reasonsing. So it's more like an attempt to program using natural language. Sometimes it does what you ask, sometimes not.
I think "programming" implies that the machine will always do what you tell it, whatever the language, or reliably fail and say it can't be done because the "program" is contradictory, lacks sufficient detail, or doesn't have the necessary permissions/technical capabilities. If it only sometimes does what you ask, then it's not quite programming yet.
> Now - I can offload the most boring task to LLM and spend my mental energy on the interesting stuff!
I wish that, too, were true, and maybe it will be someday soon. But if I need to manually review the agent's output, then it doesn't feel like offloading much aside from the typing. All the same concentration and thought are still required, even for the boring things. If I could at least trust the agent to tell me if it did a good job or is unsure that would have been helpful, but we're not even there yet.
That's not to say the tools aren't useful, but they're not yet "programming in a natural language" and not yet able to "offload" stuff to.
> ... LLMs aren't very good at logical reasonsing.
I'm curious about what experiences led you to that conclusion. IME, LLMs are very good at the type of logical reasoning required for most programming tasks. E.g. I only have to say something like "find the entries with the lowest X and highest Y that have a common Z from these N lists / maps / tables / files / etc." and it spits out mostly correct code instantly. I then review it and for any involved logic, rely on tests (also AI-generated) for correctness, where I find myself reviewing and tweaking the test cases much more than the business logic.
But then I do all that for all code anyway, including my own. So just starting off with a fully-fleshed out chunk of code, which typically looks like what I'd pictured in my head, is a huge load off my cognitive shoulders.
The experience was that I once asked an LLM to write a simple function and it produced something very wrong that nothing with good reasoning abilities should ever do. Of course, a drunk or very tired human could have done the same mistake, but they would have at least told me that they were impaired and unsure of their work.
I agree that most of the time it does most simple tasks mostly right, but that's not good enough to truly "offload" my mental effort. Again, I'm not saying it's not useful, but more than working with a junior developer it's like working with a junior developer who may or may not be drunk or tired and doesn't tell you.
But mostly my point is that LLMs seem to do logical reasoning worse than other things they do better, such as generating prose or summarising a document. Of course, even then you can't trust them yet.
> But then I do all that for all code anyway, including my own
I don't, at least not constantly. I review other people's code only towards the very end of a project, and in between I trust that they tell me about any pertinent challenge or insight, precisely so that I can focus on other things unless they draw my attention to something I need to think about.
I still think that working with a coding assistant is interesting and even exciting, but the experience of not being able to trust anything, for me at least, is unlike working with another person or with a tool and doesn't yet allow me to focus on other things. Maybe with more practice I could learn to work with something I can't trust at all.
> working with a junior developer who may or may not be drunk or tired and doesn't tell you.
Bad news, friend.
Overall though, I think you're right. It's a lot like working with people. The things you might be missing are that you can get better at this with practice, and that once you are multiplexing multiple Claudes, you can become hyper efficient. These are things I'm looking into now.
Do I know these for a fact? Not yet. But, like any tool, I'm sure that the investment won't pay off right away.
What was the simple function?
10 replies →
https://arstechnica.com/ai/2025/08/researchers-find-llms-are...
In your example, you didn't ask the LLM to do any logic. You asked it to translate your logic into code.
Asking an LLM to do logic would be saying something like: "I have a row of a million light switches. They all start off. I start at the beginning and flip on every fourth one. Then I flip on every eighth one, then sixteen, and all the powers of two until I'm over a million. Now I do the same for the powers of three, then four, then five, and so on. How many light switches are on at the end? Do not use any external coding tools for this; use your own reasoning."
Note that the prompt itself is intentionally ambiguous -- a human getting this question should say "I don't understand why you started with every fourth instead of every second. Are you skipping the first integer of every power series or just when the exponent is two?"
When I asked GPT5 to do it, it didn't care about that; instead it complimented me on my "crisp statement of the problem," roughly described a similar problem, and gave a belivable but incorrect answer 270,961 .
I then asked it to write python code to simulate my question. It got the code correct, and said "If you run this, you’ll see it matches the 270,961 result I gave earlier." except, that was a hallucination.
Running the code actually produced 252711.
I guess it went with 270,961 because that was a lexically similar answer to some lexically similar problems in the training data.
That may be the case, but how often do you have to provide these kinds of puzzles as programming tasks? And even then, by your own example, it did product correct code, so it did "reason" through the problem statement!
I would posit that most randomly selected AGIs (people) you ask this of with no tools allowed (pencil and paper) won't get closer on average. This doesn't prove people don't use logic
And allowing python shells for both I think the randomly selected human would do worse on average.
And I also think e.g. Google's IMO model would be able to pass but I have no way to verify that.
4 replies →
I'm not who you're replying to but I had a scenario where I needed to notice that a command had completed (exit code received) but keep listening for any output that was still buffered and only stop processing tokens after it had been quiet for a little bit.
Trying to get Claude to do this without introducing a deadlock and without exiting too early and leaving valuable output in the pipe was hellish.
It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at.
> It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at.
Yes, one of the terms for this phenomenon is "jagged frontier" from here:
https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...
This is also the main reason people have such diverging experiences and the discussion around AI is so bimodal in nature.
You can view Claude Code as a non-deterministic compiler where you input english and get functioning code on the other end.
The non-determinism is not as much as a problem because you are reading over the results and validating that what it is created matches what you tell it to do.
I'm not talking about vibe-coding here, I'm grabbing the steering wheel with both hands because this car allows me to go faster than if I was driving myself, but sometimes you have to steer or brake. And the analogy favors Claude Code here because you don't have to react in milliseconds while programming.
TL;DR: if you do the commit you are responsible for the code it contains.
Sure, and that may be valuable, but it's neither "programming" nor "offloading mental effort" (at least not much).
Some have compared it to working with a very junior programmer. I haven't done that in a long while, but when I did, it didn't really feel like I was "offloading" much, and I could still trust even the most junior programmer to tell me whether the job was done well or not (and of any difficulties they encountered or insight they've learnt) much more than I can an agent, at least today.
Trust is something we have, for the most part, when we work with either other people or with tools. Working without (or with little) trust is something quite novel. Personally, I don't mind that an agent can't accomplish many tasks; I mind a great deal that I can't trust it to tell me whether it was able to do what I asked or not.
”it's neither "programming"
Sure it is.
Modern ecosystem is sadly full of API:s like WPF on Windows that are both verbose and configuration heavy. Now, some people may be able to internalize xaml with little effort but not all us - and then you basically move forward iteratively, looking for code example, trying this or that … basically random walking towards something usable.
Or you use an agentic LLM and it does this peeking and poking for you, and with decades old APIs like WPF likely has enough context to do the thing you asked it to do far more competently than you could train yourself to program WPF in a few days.
Of course in the context of this example WPF was your main duty, you _would_ learn the ins and outs of it.
In quite many jobs a task like this may not infact be even among your top ten duties but you need to do it.
In these sort of situations a LLM is really nice. The worst it provides is a good first guess how to do something.
If your contex is something like radiology treatment then no, don’t use LLM!
But there are thoushands of miserable non-critical but necessary components in production for which LLM is just fine.
There is probably a significant factor of what domain you work in, that might be part of the reason why you have different experiences.
I don't know your current domain, but stuff like Loom requires very complex reasoning capabilities, most of which lives outside the actual codebase itself. Business code on the other hand often has a more direct mapping from idea to code, so LLMs might show more promise there.
So here's a transcript of a recent Claude Code interaction I had, trying to build an MP3 organizer that relies on heuristics and fuzzy matching to make sense out of all the random MP3s I downloaded in the 90s. This all happened in-between prompts: I gave it a correct example and an incorrect example and asked it why it was giving the incorrect example.
● Let me trace through what happens with the "Jordan Rudess" example:
● Read(mp3_organizer_v2.py) ⎿ Read 15 lines (ctrl+r to expand)
● The inference priority looks correct - path_component_match is indeed the highest priority. But I think I see the issue now. Let me look at the actual logic flow in the main organize method:
● Read(mp3_organizer_v2.py) ⎿ Read 10 lines (ctrl+r to expand)
● The flow looks right. Let me examine the actual problem more carefully. Looking back at the _apply_path_component_matching method, I think there might be an issue with when the matching happens. Let me check the logic:
● Update Todos ⎿ Check _path_is_in_scope logic for correctness Verify artist inference priority logic works with path_component_match Test and fix any remaining issues with artist propagation
● I think I see the issue now. The problem is that the current code only looks for matches in files that already have file_id3 tags, but it needs to propagate those matches to all files in the directory. However, the current implementation is still only applying matches to files that have matching candidates, not ALL files. Let me check the exact logic:
● Read(mp3_organizer_v2.py) ⎿ Read 8 lines (ctrl+r to expand)
● Actually, looking at this more carefully, the logic should work. The matched_artists dictionary contains the matches found from ANY file that had a file_id3 tag matching a path component, and then it applies that match to ALL files in the file_infos list via the loop at line 737.
● Read(mp3_organizer_v2.py) ⎿ Read 5 lines (ctrl+r to expand)
● Ah! I think I see a potential issue. The file_rel_path is calculated as the directory path, not the individual file path. So for all files in the same directory, file_rel_path will be the same. This should actually work correctly.
It's doing a pretty complex chain of step-by-step reasoning here, looking in detail at the code. It ended up by printing out more debug info and having me re-run it again, then paste in the debug info, then add even more debug info and run it again. But it did eventually get the bug, which was non-trivial to identify (it was an aliasing problem where Claude was mutating state on a list and that list was shared with all the other files in the directory).
Huh. Alternate explanation: there's a layer of indirection, drawing upon the unthinkable size of the source data, so rather than 'issue forth tokens as if there is a person answering a question', you've got 'issue forth tokens as if there is a person being challenged to talk about their process', something that's also in the training data but in different contexts.
I'm not sure statements of 'aha, I see it now!' are meaningful in this context. Surely this is just the em-dash of 'issue tokens to have the user react like you're thinking'?
I wonder if something else is going on, and perhaps Claude is using the LLM to identify the likely culprits within the codebase, sending the code around them to execute with an actual Python interpreter on their servers, feeding both the code and the result as the context window to another LLM query with a system prompt something like "What is this code doing, when it runs on this input and this output?", feeding the result of that back to the user, and then repeating as long as the overall bug remains unsolved. I've found that feedback is a very effective technique with LLMs, asking them to extract some data, testing that data through out-of-band mechanisms, then feeding the test results and the original context back into the LLM to explain its reasoning and why it got the result. The attention mechanisms in the transformer model function very well when they're prompted with specifics and asked to explain their reasoning.
Only an Anthropic engineer would know for sure. I'm pretty sure that it was making multiple queries on my behalf during the chat transcript - each "Read ... mp3organizer_v2.py" is a separate network round-trip.