Comment by gcgbarbosa
5 hours ago
"the intelligence is clearly there"
I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.
But when an LLM does it on an area we know, we notice and suddenly it's too much.
Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.
With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.
No. It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars. For example…
A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:
“That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.
Short answer There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc. … Why it sounds like Spurgeon It fits his theology and rhetoric almost perfectly: • etc etc. … Closest authentic themes (but not the quote) Spurgeon repeatedly says things like: • etc etc. … So the quote is basically: a modern condensation of real Spurgeon ideas, not a verifiable citation etc. etc.”
Utter bullshit. One web search produces the full sermon manuscript with the quote.
One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.
>It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars.
"The Dunning-Kruger effect describes a disturbing cognitive bias that afflicts us all. People with limited expertise in an area tend to overestimate how much they know—and we all have gaps in our expertise." [1]
[1] https://www.openmindmag.org/articles/david-dunning-on-expert...
> But when an LLM does it on an area we know, we notice and suddenly it's too much.
Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?
Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.
I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.
It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
Funny you used this example :)
I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.
My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change
I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job
E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.
> while some c++ exotic physics simulation developer will find it lacking
Can confirm, but I always read I am holding it wrong.
I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials
The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically
I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer
>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.
I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently
Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem
I struggle to see how these tools are of any use
6 replies →
You're not. People are just using a hammer to build a shed and telling you it's surely good to dig a hole too.
That's a better score than I'd give my own thinking.
In my experience of hiring and managing people, I would have been very happy if they gave good answers or produced good results 80% of the time.
GPT-5.5, 100% so far for all of my problems that actually have an anwser.