Comment by xnorswap
1 day ago
> it can't find actual flaws in your code
I can tell from this statement that you don't have experience with claude-code.
It might just be a "text predictor" but in the real world it can take a messy log file, and from that navigate and fix issues in source.
It can appear to reason about root causes and issues with sequencing and logic.
That might not be what is actually happening at a technical level, but it is indistinguishable from actual reasoning, and produces real world fixes.
> I can tell from this statement that you don't have experience with claude-code.
I happen to use it on a daily basis. 4.6-opus-high to be specific.
The other day it surmised from (I assume) the contents of my clipboard that I want to do A, while I really wanted to B, it's just that A was a more typical use case. Or actually: hardly anyone ever does B, as it's a weird thing to do, but I needed to do it anyway.
> but it is indistinguishable from actual reasoning
I can distinguish it pretty well when it makes mistakes someone who actually read the code and understood it wouldn't make.
Mind you: it's great at presenting someone else's knowledge and it was trained on a vast library of it, but it clearly doesn't think itself.
What do you mean the content of your clipboard?
I either accidentally pasted it somewhere and removed, forgetting about doing that or it's reading the clipboard.
The suggestion it gave me started with the contents of the clipboard and expanded to scenario A.
5 replies →
What you're describing is not finding flaws in code. It's summarizing, which current models are known to be relatively good at.
It is true that models can happen to produce a sound reasoning process. This is probabilistic however (moreso than humans, anyway).
There is no known sampling method that can guarantee a deterministic result without significantly quashing the output space (excluding most correct solutions).
I believe we'll see a different landscape of benefits and drawbacks as diffusion language models begin to emerge, and as even more architectures are invented and practiced.
I have a tentative belief that diffusion language models may be easier to make deterministic without quashing nearly as much expressivity.
This all sounds like the stochastic parrot fallacy. Total determinism is not the goal, and it not a prerequisite for general intelligence. As you allude to above, humans are also not fully deterministic. I don't see what hard theoretical barriers you've presented toward AGI or future ASI.
Did you just invent a nonsense fallacy to use as a bludgeon here? “Stochastic parrot fallacy” does not exist, and there actually quite a bit of evidence supporting the stochastic parrot hypothesis.
1 reply →
I haven't heard the stochastic parrot fallacy (though I have heard the phrase before). I also don't believe there are hard theoretical barriers. All I believe is that what we have right now is not enough yet. (I also believe autoregressive models may not be capable of AGI.)
> moreso than humans
Citation needed.
Much of the space of artificial intelligence is based on a goal of a general reasoning machine comparable to the reasoning of a human. There are many subfields that are less concerned with this, but in practice, artificial intelligence is perceived to have that goal.
I am sure the output of current frontier models is convincing enough to outperform the appearance of humans to some. There is still an ongoing outcry from when GPT-4o was discontinued from users who had built a romantic relationship with their access to it. However I am not convinced that language models have actually reached the reliability of human reasoning.
Even a dumb person can be consistent in their beliefs, and apply them consistently. Language models strictly cannot. You can prompt them to maintain consistency according to some instructions, but you never quite have any guarantee. You have far less of a guarantee than you could have instead with a human with those beliefs, or even a human with those instructions.
I don't have citations for the objective reliability of human reasoning. There are statistics about unreliability of human reasoning, and also statistics about unreliability of language models that far exceed them. But those are both subjective in many cases, and success or failure rates are actually no indication of reliability whatsoever anyway.
On top of that, every human is different, so it's difficult to make general statements. I only know from my work circles and friend circles that most of the people I keep around outperform language models in consistency and reliability. Of course that doesn't mean every human or even most humans meet that bar, but it does mean human-level reasoning includes them, which raises the bar that models would have to meet. (I can't quantify this, though.)
There is a saying about fully autonomous self driving vehicles that goes a little something like: they don't just have to outperform the worst drivers; they have to outperform the best drivers, for it to be worth it. Many fully autonomous crashes are because the autonomous system screwed up in a way that a human would not. An autonomous system typically lacks the creativity and ingenuity of a human driver.
Though they can already be more reliable in some situations, we're still far from a world where autonomous driving can take liability for collisions, and that's because they're not nearly as reliable or intelligent enough to entirely displace the need for human attention and intervention. I believe Waymo is the closest we've gotten and even they have remote safety operators.
6 replies →
Nothing you've said about reasoning here is exclusive to LLMs. Human reasoning is also never guaranteed to be deterministic, excluding most correct solutions. As OP says, they may not be reasoning under the hood but if the effect is the same as a tool, does it matter?
I'm not sure if I'm up to date on the latest diffusion work, but I'm genuinely curious how you see them potentially making LLMs more deterministic? These models usually work by sampling too, and it seems like the transformer architecture is better suited to longer context problems than diffusion
The way I imagine greedy sampling for autoregressive language models is guaranteeing a deterministic result at each position individually. The way I'd imagine it for diffusion language models is guaranteeing a deterministic result for the entire response as a whole. I see diffusion models potentially being more promising because the unit of determinism would be larger, preserving expressivity within that unit. Additionally, diffusion language models iterate multiple times over their full response, whereas autoregressive language models get one shot at each token, and before there's even any picture of the full response. We'll have to see what impact this has in practice; I'm only cautiously optimistic.
2 replies →