Comment by rainsford
14 hours ago
I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.
As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.
It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.
Yup, spot on. There's a capability-reliability gap that the industry does not like to talk about too much.
It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).
Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/
I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).
This capability-reliability gap (excellent term btw, more people need to think in these terms or we'll be in real trouble) is also infecting LLM assisted outputs. I just tried VSCode again tonight after a ~3yr hiatus and goddamn has it deteriorated. Lots of new features, lots of interesting looking plugins, but 3 out of the 5 plugins I tried for code CAD (the reason I downloaded VSCode again at all) were completely unusable--like couldn't even be made to work at all--and the other two didn't do anything like what they claimed. Also VSCode itself got into some kind of spastic loop trying to log me into github, and seemed incapable of recognizing the virtual environment in a python project's workspace... It also feels like the UI got even slower. This situation is bad.
Not my term! Some real academics came up with it: https://www.normaltech.ai/p/new-paper-towards-a-science-of-a...
1 reply →
Your analogy reminds of messed up fingers and hands in image generation models just a year ago. Now that is pretty much solved. These days they are generating videos you can't tell apart from reality. This makes me believe these nuances will keep reducing and eventually become very hard to notice and find in may be every task.
I would suggest slightly adjusting your expectations by factoring in the difference between video training data and text training data. Due to computation and cost limitations, the idea of video training data being polluted with AI video slop is less of a thing. Also, humans don't generate a lot of biology and physics defying fictional video relative to the abundance and generation ease of real-life video.
The main problem currently with LLM text is not that they create incoherent sentences, it's that what they purport to be statements of fact or general consensus often times aren't, because they are bullshit machines that become better and more accurate bullshitters the more context-accurate data they are fed. AI videos may still have issues with "looking plausible" whereas LLM text currently has less issues with "sounding plausible" and more issues with "being correct" with respect to reality. Which they have no direct connection to.
No one is penalizing an AI video generator for creating a scene that never happened in real life.
Yesterday I was using opus 4.6 through copilot (don't ask...) to rubber-duck-brainstorm a big feature that needs a lot of care.
I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.
I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.
So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.
They absolutely need deterministic tools. What you just described is exactly how the current popular AI agents work. They use "harnesses", which to me is just a rebranding of what we have known all along about building useful and reliable software...composable orchestrated systems with a variety of different pieces selected based on their capabilities and constraints being glued together for specific outcomes.
It just feels like for some reason this is all being relearned with LLMs. I guess shortcuts have always been tempting. And the idea of a "digital panacea" is too hard to resist.
I think that is how the smarter agents do things? Just like Claude/ChatGPT sometimes does a web search they can do other tool calls instead of just making a statistical guess. Of course it doesn’t always make the bright choice between those options though…
They will also lie and produce output saying it is based on tool execution, without having actually used the tool.
Yes, another layer to cross-check, say, “in kubectl logs I see …” with an actual k8s tool call can help, that is, when the cross-check layer doesn’t lie.
For the time being, IMHO, human validation in key points is the only way to get good results. This is why the tools make experienced people potentially a lot more efficient (they are quick to spot errors/BS) and inexperienced people potentially more dangerous (they’re more prone to trusting the responses, since the tone is usually very professionally sounding).
> it doesn’t always make the bright choice
I'm available for a small fee.
2 replies →
Doesn't agentic AI do this? I've got AI running in VS Code. If I ask it for something, it can fill a code cell with a little bit of Python, and then run it with my approval. It's using the Python interpreter on my computer as a calculator.
That’s exactly how all the current cloud chat bots and agents work now.
No, they just need to be trained to have adversarial self review "thinking" processes.
You ask an LLM "What's wrong with your answer?" and you get pretty good results.
Or you get the original output result was perfect and the adversarial "rethinking" switches to an incorrect result.
1 reply →
> we're not actually on the right track to achieve real intelligence.
Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.
The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.
> Real intelligence means you have to say "I don't know" when you don't know
I have met many supposedly intelligent, certainly high status, humans who don't appear to be able to do that either.
I have more confidence we can train AIs to do it, honestly.
While it is true that there are people who do not admit they are wrong when they factually are, your assertion glosses over the fact that most of the people we maintain in our social circle are people we trust through our experiences with them to be honest.
That's just not how they work, really. They don't know what they don't know and their process requires an output.
I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.
They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.
20 replies →
My theory is because the people building the models and in charge of directing where they go love the sycophantic yes-man behavior the models display
They don't like hearing "I don't know"
You can TELL the models to do this and they'll follow your prompt.
"Give me your answer and rate each part of it for certainty by percentage" or similar.
could you please tell me how it generates that certainty score?
9 replies →
You can just tell the agent to do exactly that
I've had various agents backed by various models ignore the shit out of various rules and request at varying rates but they all do it.
When you point it out "Oh yes, I did do that which is contrary to the rules, request <whatever>.. Anyway..."
2 replies →
Except you can't be sure it isn't producing nonsense when you do this, and generally the model(s) will be overconfident. This has been studied, see e.g. https://openreview.net/pdf?id=E6LOh5vz5x
>You can just tell the agent to do exactly that
You can.
It just won't do it.
1 reply →