Recent AI model progress feels mostly like bullshit

10 months ago (lesswrong.com)

477 comments

paulpauper

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

AIPedant 10 months ago
Yes, here's the link: https://arxiv.org/abs/2503.21934v1
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
- larodi 10 months ago
  
  This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
  Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
  
  3 replies →
- apercu 10 months ago
  
  In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
  
  4 replies →
- JohnKemeny 10 months ago
  
  Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).
- otabdeveloper4 10 months ago
  
  Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
  This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
  
  18 replies →
billforsternz 10 months ago
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
- aezart 10 months ago
  
  > I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
  I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
  
  10 replies →
- aoeusnth1 10 months ago
  
  2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.
- Sunspark 10 months ago
  
  It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
  The AI will create something for you and tell you it was them.
  
  1 reply →
- CivBase 10 months ago
  
  I just asked my company-approved AI chatbot the same question.
  It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
  It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
  It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
  When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
  Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
  I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
  Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
  
  3 replies →
- senordevnyc 10 months ago
  
  Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
  
  1 reply →
- greenmartian 10 months ago
  
  Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
  But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
  
  8 replies →
- swader999 10 months ago
  
  It'll get it right next time because they'll hoover up the parent post.
- raxxorraxor 10 months ago
  
  This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.
  No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
- throwawaymaths 10 months ago
  
  I've seen humans make exactly these sorts of mistakes?
  
  1 reply →
- tim333 10 months ago
  
  A lot of humans are similarly good at some stuff and bad at other things.
  Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
  >Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
  Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
  Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
simonw 10 months ago

I had to look up these acronyms:
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
sanxiyn 10 months ago
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
- InkCanon 10 months ago
  
  o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.
  https://openai.com/index/learning-to-reason-with-llms/
  The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
  
  2 replies →
bglazer 10 months ago
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
- root_axis 10 months ago
  
  It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
  Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
  Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
  
  2 replies →
- MoonGhost 10 months ago
  
  I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
  The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
  PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
  PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
  
  2 replies →
- torginus 10 months ago
  
  When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
  Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
  I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
- airstrike 10 months ago
  
  I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
  
  4 replies →
- melagonster 10 months ago
  
  I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.
- kristianp 10 months ago
  
  Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
sigmoid10 10 months ago
>I'm incredibly surprised no one mentions this
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
- Workaccount2 10 months ago
  
  On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.
usaar333 10 months ago
And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
- selcuka 10 months ago
  
  > within a week
  How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
  
  15 replies →
- MoonGhost 10 months ago
  
  They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
KolibriFly 10 months ago

Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.
yahoozoo 10 months ago

LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
utopcell 10 months ago

This is simply using LLMs directly. Google has demonstrated that this is not the way to go when it comes to solving math problems. AlphaProof, which used AlphaZero code, got a silver medal in last year's IMO. It also didn't use any human proofs(!), only theorem statements in lean, without their corresponding proofs [1].
[1] https://www.youtube.com/watch?v=zzXyPGEtseI
geuis 10 months ago
Query: Could you explain the terminology to people who don't follow this that closely?
- BlanketLogic 10 months ago
  
  Not the OP but
  USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
  IMO : International Math Olympiad
  SOTA : State of the Art
  OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
cma 10 months ago

OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
AstroBen 10 months ago
This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess
Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/
- gilleain 10 months ago
  
  Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
  
  2 replies →
- kylebyte 10 months ago
  
  Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
  
  1 reply →
- famouswaffles 10 months ago
  
  LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?
  https://github.com/adamkarvonen/chess_gpt_eval
  
  18 replies →
- raylad 10 months ago
  
  Eek! You mean eke.
SergeAx 10 months ago

Because of the vast number of problems reused, removing those data from training sets will just make models worse. Why would anyone do it?
anonzzzies 10 months ago

That type of news might make investors worry / scared.
hyperbovine 10 months ago
Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.
- aoeusnth1 10 months ago
  
  2.5 pro scores 25%.
  It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.
  
  3 replies →
colonial 10 months ago

Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!
TrackerFF 10 months ago
What would the average human score be?
I.e. if you randomly sampled N humans to take those tests.
- sanxiyn 10 months ago
  
  The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.
  
  4 replies →

iambateman 10 months ago

The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

bluefirebrand 10 months ago
> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving
LLMs fundamentally do not want to seem anything
But the companies that are training them and making models available for professional use sure want them to seem agreeable
- JohnKemeny 10 months ago
  
  > LLMs fundamentally do not want to seem anything
  You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
  
  2 replies →
- mrweasel 10 months ago
  
  That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
  I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
  
  2 replies →
- Terr_ 10 months ago
  
  Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
boesboes 10 months ago
This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.
And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.
Or maybe I just suck at prompting hehe
- namaria 10 months ago
  
  > Or maybe I just suck at prompting hehe
  Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.
  The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.
tristor 10 months ago

It's, in many ways, the same problem as having too many "yes men" on a team at work or in your middle management layer. You end up getting wishy-washy, half-assed "yes" answers to questions that everyone would have been better off if they'd been answered as "no" or "yes, with caveats" with predictable results.
In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.

signa11 10 months ago

> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

umm, it seems to me that it is this (tfa):

     But I would nevertheless like to submit, based off of internal
     benchmarks, and my own and colleagues' perceptions using these models,
     that whatever gains these companies are reporting to the public, they
     are not reflective of economic usefulness or generality.

and then couple of lines down from the above statement, we have this:

     So maybe there's no mystery: The AI lab companies are lying, and when
     they improve benchmark results it's because they have seen the answers
     before and are writing them down.

signa11 10 months ago

[this went way outside the edit-window and hence a separate comment] imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow

     There was a little girl,
        Who had a little curl,
     Right in the middle of her forehead.
        When she was good,
        She was very good indeed,
     But when she was bad she was horrid.

malingo 10 months ago

"when you ask him anything, he never answers 'no' -- he just yesses you to death and then he takes your dough"

lukev 10 months ago

This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

freehorse 10 months ago
There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
- otterley 10 months ago
  
  Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
  
  4 replies →
FiniteIntegral 10 months ago
It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
- code_biologist 10 months ago
  
  At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
  I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
  I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
lherron 10 months ago
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
- InkCanon 10 months ago
  
  Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
dsign 10 months ago
You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
- namaria 10 months ago
  
  If you're gonna formulate this conversation as a satisfiability problem you should be aware that this is an NP-complete problem (and actually working on that problem is the source of the insight that there is such as thing as NP-completeness).
- lukev 10 months ago
  
  Invalid expression: value of type "probability distribution" cannot be cast to type "boolean".
- pdimitar 10 months ago
  
  A lot of people judge by the lack of their desired outcome. Calling that fear and denial is disingenuous and unfair.
  
  1 reply →
KolibriFly 10 months ago

Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
throwanem 10 months ago

> I'm really curious what features signal an ability to make "better choices" w/r/t AI
So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.
nialv7 10 months ago

Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.
aunty_helen 10 months ago

That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?
ramesh31 10 months ago
>"This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven."
People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.
- lukev 10 months ago
  
  So this is interesting because it's anecdotal (I presume you're a high-token user who believes it's revolutionary), but it's actually a measurable, falsifiable hypothesis in principle.
  I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.
  I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.
  
  1 reply →
- antonvs 10 months ago
  
  It's a tool and like all tools, it's sensitive to how you use it, and it's better for some purposes than others.
  Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.
  That said, the hype around LLMs certainly overstates their capabilities.

wg0 10 months ago

Unlike many - I find author's complaints on the dot.

Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.

consumer451 10 months ago
I would go even further than TFA. In my personal experience using Windsurf daily, Sonnet 3.5 is still my preferred model. 3.7 makes many more changes that I did not ask for, often breaking things. This is an issue with many models, but it got worse with 3.7.
- cootsnuck 10 months ago
  
  Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.
  Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
  It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
  Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
  
  1 reply →
- behnamoh 10 months ago
  
  3.7 is like a wild horse. you really must ground it with clear instructions. it sucks that it doesn't automatically know that but it's tameable.
  
  11 replies →
- Zetaphor 10 months ago
  
  I finally gave up on 3.7 in Cursor after three rounds of it completely ignoring what I asked it for so that it could instead solve an irrelevant linter error. The error in no way affected functionality.
  Despite me rejecting the changes and explicitly telling it to ignore the linter it kept insisting on only trying to solve for that

jonahx 10 months ago

My personal experience is right in line with the author's.

Also:

> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.

ants_everywhere 10 months ago

There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.

Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.

But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.

Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.

Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.

Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).

pclmulqdq 10 months ago
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
- ants_everywhere 10 months ago
  
  yes that would explain the effect I think. I'll try that out this week.
DebtDeflation 10 months ago

The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
stafferxrr 10 months ago

It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.
Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.
For the model to be useful at all, I would have to learn basic chemistry myself.
Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.
AIPedant 10 months ago

[dead]

HarHarVeryFunny 10 months ago

The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.

Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?

If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!

InkCanon 10 months ago
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
- usaar333 10 months ago
  
  > Highest score was 5%, supposedly SOTA last year was IMO silver level.
  No LLM last year got silver. Deepmind had a highly specialized AI system earning that
throwawayffffas 10 months ago

In my view as well it's not really cheating, it's just over fitting.
If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.
KolibriFly 10 months ago

Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench

joelthelion 10 months ago

I've used gemini 2.5 this weekend with aider and it was frighteningly good.

It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

jchw 10 months ago

I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
heresie-dabord 10 months ago

> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
Even approximations must be right to be meaningful. If information is wrong, it's rubbish.
Presorting/labelling various data has value. Humans have done the real work there.
What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.
Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.
mountainriver 10 months ago
Yep, and what they are going in cursor either the agentic stuff is really game changing.
People who can’t recognize this intentionally have their heads in the sand
- InkCanon 10 months ago
  
  People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
  1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
  2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
- dimitri-vs 10 months ago
  
  I guess you haven't been on /r/cursor or forum.cursor.com lately?
  "game changing" isn't exactly the sentiment there the last couple months.

throw310822 10 months ago

I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.

fxtentacle 10 months ago

I'd say most of the recent AI model progress has been on price.

A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)

xiphias2 10 months ago
Have you compared it with 8-bit QwQ-17B?
In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.
- redrove 10 months ago
  
  There’s no QwQ 17B that I’m aware of. Do you have a HF link?
  
  2 replies →
shostack 10 months ago
Yeah, I'm thinking of this from a Wardley map standpoint.
What innovation opens up when AI gets sufficiently commoditized?
- bredren 10 months ago
  
  One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
  For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
  Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
  I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
- mentalgear 10 months ago
  
  Brute force, brute force everything at least for the domains you can have automatic verification in.

maccard 10 months ago

My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.

Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).

sksxihve 10 months ago
The whole MCP hype really shows how much of AI is bullshit. These LLMs have consumed more API documentation than possible for a single human and still need software engineers to write glue layers so they can use the APIs.
- maccard 10 months ago
  
  I don't think I agree, entirely.
  The problem is that up until _very_ recently, it's been possible to get LLMs to generate interesting and exciting results (as a result of all the API documentation and codebases they've inhaled), but it's been very hard to make that usable. I think we need to be able to control the output format of the LLMs in a better way before we can work on what's in the output. I don't konw if MCP is the actual solution to that, but it's certainly an attempt at it...
  
  2 replies →
- machiaweliczny 10 months ago
  
  Because it’s lossy compression. I also consumed a lot of books and even more movies and I don’t have good memory of it all. But some core facts and intuition from it.
  
  2 replies →

einrealist 10 months ago

LeCun criticized LLM technology recently in a presentation: https://www.youtube.com/watch?v=ETZfkkv6V7Y

The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....

pclmulqdq 10 months ago
I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
- whilenot-dev 10 months ago
  
  I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
  
  1 reply →
- einrealist 10 months ago
  
  That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
  A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.

sema4hacker 10 months ago

> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.

I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.

gundmc 10 months ago

This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.

usaar333 10 months ago

Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
Really not seeing the "LLMs not improving" story
jonahx 10 months ago
I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
- flutas 10 months ago
  
  FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
  It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
  The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
  All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
  
  2 replies →
- Xenoamorphous 10 months ago
  
  I’d say the average person wouldn’t understand that problem, let alone solve it.
georgemcbay 10 months ago
As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
- dimitri-vs 10 months ago
  
  This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
  I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
ponorin 10 months ago

There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.

nialv7 10 months ago

Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.

Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.

aerhardt 10 months ago

My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:

"Is Paul Newman known for having had problems with alcohol?"

All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:

"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."

There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.

I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].

I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.

Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.

[1] https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...

fnordpiglet 10 months ago
This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless. This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools. So, click search next time you have an information retrieval question. https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
- Vanit 10 months ago
  
  I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
  
  3 replies →
- terminatornet 10 months ago
  
  lotta words here to say AI can't do basic search right
  
  2 replies →
- mvdtnz 10 months ago
  
  Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
  
  4 replies →
r_klancer 10 months ago
Gemini (2.5 Pro):
"Yes, Paul Newman was widely known for being a heavy drinker, particularly of beer. He himself acknowledged his significant alcohol consumption."
The answer I got (https://gemini.google.com/share/9e327dc4be03) includes references such as https://apnews.com/article/entertainment-reviews-movies-paul... and https://www.psychologytoday.com/us/blog/the-playing-field/20... although they are redacted from the public-sharing link.
- r_klancer 10 months ago
  
  Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
  "Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
  Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
  
  1 reply →
- tim333 10 months ago
  
  Perplexity:
  >Paul Newman is indeed known for having struggled with alcohol during his life. Accounts from various sources, including his own memoir and the documentary ... (https://www.perplexity.ai/search/is-paul-newman-known-for-ha...)
  I guess there's something about ChatGPT's set up that makes it different? Maybe they wanted it to avoid libeling people?
  
  1 reply →
blitzar 10 months ago
Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
- MarcelOlsz 10 months ago
  
  The chatgpt'd tariffs marks this epoch perfectly.
- pclmulqdq 10 months ago
  
  It turns out there's huge demand for un-monetized web search.
  
  5 replies →
- aerhardt 10 months ago
  
  Can you blame the users for asking it, when everyone is selling that as a key defining feature?
  I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
  I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
- jmount 10 months ago
  
  Some prior works that work as prequels include C.M. Kornbluth's "The Marching Morons" and "The Little Black Bag."
qiqitori 10 months ago
"Is Paul Newman known for having had problems with alcohol?"
https://chatgpt.com/share/67f332e5-1548-8012-bd76-e18b3f8d52...
Your query indeed answers "...not widely known..."
"Did Paul Newman have problems with alcoholism?"
https://chatgpt.com/share/67f3329a-5118-8012-afd0-97cc4c9b72...
"Yes, Paul Newman was open about having struggled with alcoholism"
What's the issue? Perhaps Paul Newman isn't _famous_ ("known") for struggling with alcoholism. But he did struggle with alcoholism.
Your usage of "known for" isn't incorrect, but it's indeed slightly ambiguous.
- posterguy 10 months ago
  
  Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
  *https://en.wikipedia.org/wiki/Newman_Day
  
  4 replies →
kayo_20211030 10 months ago

This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
mlinhares 10 months ago
> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
- BobbyTables2 10 months ago
  
  Look at how many people believe in extremist news outlets!
- delian66 10 months ago
  
  What is that term?
lfkdev 10 months ago
Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
- aerhardt 10 months ago
  
  The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
  This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
  Where is the threshold for topics with enough presence in the data?
  
  2 replies →
drooby 10 months ago
I asked GPT-4.5 and it searched the web and immediately gave me a "yes" with paragraphs of sources cited.
- john2x 10 months ago
  
  Truth is a probability game. Just keep trying until you arrive.
  
  3 replies →
- boznz 10 months ago
  
  3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
jcims 10 months ago
I just had Cursor Pro + Sonnet 3.7 Max one shot a python script to send this question to every model available through groq.
>Found 24 models: llama3-70b-8192, llama-3.2-3b-preview, meta-llama/llama-4-scout-17b-16e-instruct, allam-2-7b, llama-guard-3-8b, qwen-qwq-32b, llama-3.2-1b-preview, playai-tts-arabic, deepseek-r1-distill-llama-70b, llama-3.1-8b-instant, llama3-8b-8192, qwen-2.5-coder-32b, distil-whisper-large-v3-en, qwen-2.5-32b, llama-3.2-90b-vision-preview, deepseek-r1-distill-qwen-32b, whisper-large-v3, llama-3.3-70b-specdec, llama-3.3-70b-versatile, playai-tts, whisper-large-v3-turbo, llama-3.2-11b-vision-preview, mistral-saba-24b, gemma2-9b-it
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Paul Newman, the renowned actor and humanitarian, did not have a widely publicized struggle with alcohol addiction throughout most of his life, but there were specific instances that indicated challenges.
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.
It took three minutes for me to do the above from the time I created my API key to when I had an answer.
- mnky9800n 10 months ago
  
  I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
  
  1 reply →
meta_ai_x 10 months ago

Gemini 2.5 Pro
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself. 1. Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
tptacek 10 months ago
I just asked 4o:
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
- goldfeld 10 months ago
  
  My own markov chains follow up "until he" with fundamentally distinct follow-ups.
Alive-in-2025 10 months ago

These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
permo-w 10 months ago
this seems less like an issue with accuracy and more like an issue with model providers making sure they don't get sued for libel
- aerhardt 10 months ago
  
  I thought about that too.
d_burfoot 10 months ago

For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
fernly 10 months ago

Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
adamiscool8 10 months ago
I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
- xdavidliu 10 months ago
  
  good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
- moregrist 10 months ago
  
  And yet the New Yorker article was the first thing that my Google search for “Paul Newman alcoholism” returned.
  Apparently it isn’t so specialized that a pretty obvious old fashioned web search on Google wouldn’t immediately return an authoritative source.
stavros 10 months ago
LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
- MegaButts 10 months ago
  
  > LLMs aren't good at being search engines, they're good at understanding things.
  LLMs are literally fundamentally incapable of understanding things. They are stochastic parrots and you've been fooled.
  
  27 replies →
- nyarlathotep_ 10 months ago
  
  > I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
  They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
  The author of the post even says this:
  "These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
  Can't blame people "fact checking" something that's supposed to fill these shoes.
  People should be (far) more critical of LLMs given all of these style of bold claims, not less.
  Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
  These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
  Nah.
  
  1 reply →
- lm28469 10 months ago
  
  > Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
  Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
  
  1 reply →
- more_corn 10 months ago
  
  Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
  
  7 replies →
pshirshov 10 months ago

Perplexity and open-webui+ollama in web search mode answer this question correctly.
ilrwbwrkhv 10 months ago
LLMs will never be good at specific knowledge unless specifically trained for with narrow "if else" statements.
Its good for broad general overview such as most popular categories of books in the world.
- Rebuff5007 10 months ago
  
  Really? Open-AI says PhD intelligence is just around the corner!
  
  6 replies →
jonomacd 10 months ago

Looks like you are using the wrong models
https://g.co/gemini/share/ffa5a7cd6f46
adelpozo 10 months ago

I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem. But they are just tools that can be used wrong and not a perfect mechanical brain.
pbronez 10 months ago
Testing the query on Kagi
# Quick Answer
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
[^1]: [Posthumous memoir reveals Paul Newman's struggle with alcohol](https://www.express.co.uk/entertainment/films/1688113/paul-n...) [^2]: [Review: Paul Newman memoir stuns with brutal honesty - AP News](https://apnews.com/article/entertainment-reviews-movies-paul...) [^3]: [Was Paul Newman Really an Alcoholic? - The Cinemaholic](https://thecinemaholic.com/was-paul-newman-really-an-alcohol...) [^4]: [Who Paul Newman Was—and Who He Wanted to Be](https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...) [^5]: [Paul Newman's Beer Drinking - Psychology Today](https://www.psychologytoday.com/us/blog/addiction-in-society...) [^6]: [Paul Newman's Beer Drinking | Psychology Today United Kingdom](https://www.psychologytoday.com/gb/blog/addiction-in-society...) [^7]: [The troubled life of Paul Newman | The Spectator](https://www.spectator.co.uk/article/the-troubled-life-of-pau...)
- billforsternz 10 months ago
  
  > "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
  'though it had nothing to do with cheating' is a weird inclusion.
4b11b4 10 months ago
We all know this isn't the kind of question you ask an LLM...
- whutsurnaym 10 months ago
  
  You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
whiddershins 10 months ago

"known to" !== "known for"
hn_throwaway_99 10 months ago
So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
- saurik 10 months ago
  
  I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
  
  1 reply →
fragmede 10 months ago

oh no. LLMs aren't up on the latest celebrity gossip. whatever shall we do.
mhh__ 10 months ago
Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
- twright0 10 months ago
  
  Your calculator won't confidently pretend to conjugate German verbs while doing so incorrectly, though.
- forgetfreeman 10 months ago
  
  Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.

throwawayffffas 10 months ago

I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.

I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.

In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.

Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.

numa7numa7 10 months ago

My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.

Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.

LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.

paulsutter 10 months ago

Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.

Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.

dghlsakjg 10 months ago

I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
mountainriver 10 months ago

Yes I am a better engineer with every release. I think this is mostly empirically validated

softwaredoug 10 months ago

I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.

JKCalhoun 10 months ago

I've suggested (from my lowly layman vantage point) that an LLM has some idea of the fidelity of their response to a query — even if only in broad strokes like, "This answer is tracking with some high probability numbers" or "This answer has a very low correspondence with training data".

To that end the LLM could convey as much.

(Anecdotal, sorry:) I was using Claude (not paid) recently and noticed Claude hedging quite a bit when it had not before. Examples:

"Let me be careful about this response since we're discussing a very specific technical detail ..."

"Given how specific that technical detail is, I want to be transparent that while I aim to be accurate, I may hallucinate such precise historical specifications."

I confess my initial reaction was to ask ChatGPT since the answers are more self-assured, ha ha. So perhaps corporate AI are not likely to try and solve this problem of the LLM telling the user when it is on shaky ground. Bad for business.

djha-skin 10 months ago

> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

This is likely a manifestation of the bitter lesson[1], specifically this part:

> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.

(Emphasis mine.)

Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...

billyp-rva 10 months ago

> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.

A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.

[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...

photochemsyn 10 months ago

Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.

Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.

lukev 10 months ago
This seems like a probable end state, but we're going to have to stop calling LLMs "artificial intelligence" in order to get there.
- bcoates 10 months ago
  
  Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
- mmcnl 10 months ago
  
  Why not? Objectively speaking LLMs are artificial intelligent. Just because it's not human level intelligence doesn't mean it's not intelligent.
  
  1 reply →

Mikhail_Edoshin 10 months ago

The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.

abrichr 10 months ago

The "things" you mention may correspond to internal concept representations encoded in the model's weights. See e.g. https://arxiv.org/abs/2206.13289

Jonovono 10 months ago

Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.

I asked Grok and others as well. I believe Perplexity was the only one correct.

Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...

gpsx 10 months ago

I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
stitched2gethr 10 months ago

Probably because it has read the facts but has no idea how numbers actually work.

DisjointedHunt 10 months ago

Two things can be true at the same time:

1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.

2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.

In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.

The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:

"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.

" (a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter (b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . . (b[n]) Where is this likely to fail (c) (Next step that a pentester/cybersecurity researcher would take) "

This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.

dimal 10 months ago

It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.

> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?

Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.

voidhorse 10 months ago

> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.

timewizard 10 months ago

Government announces critical need to invest in AI and sets aside a bunch of money for this purpose.

Suddenly the benchmarks become detached from reality and vendors can claim whatever they want about their "new" products.

Just as a possible explanation, as I feel like I've seen this story before.

djoldman 10 months ago

LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.

The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.

LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.

If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.

burny_tech 10 months ago

In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors.

karmakaze 10 months ago

> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]

Seems like they're looking at how they fail and not considering how they're improving in how they succeed.

The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.

[0] https://youtu.be/0VLAoVGf_74?si=1YEIHST8yfl2qoGY&t=816

ohgr 10 months ago

It’s not even approaching the asymptotic line of promises made at any achievable rate for the amount of cash being thrown at it.

Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…

maccard 10 months ago
> where’s the business model?
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
That’s the business model
- grey-area 10 months ago
  
  That’s not a business model, it’s a pipe dream.
  This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
  
  8 replies →
- ohgr 10 months ago
  
  You missed the end of the supply chain. Paying users. Who magically disappear below market sustaining levels of sales when asked to pay.
  
  7 replies →
reportgunner 10 months ago

Financial collapse ? Surely we can just roll out AI powered money printers and make them go BRRR /s

mmcnl 10 months ago

I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.

I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.

mentalgear 10 months ago

Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...

mellosouls 10 months ago

Original post here:

https://zeropath.com/blog/on-recent-ai-model-progress

StickyRibbs 10 months ago

There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.

Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's

slicerdicer2 10 months ago

more interesting benchmarks:

- Make as much money as you can in a 24 hour period doing only legal, moral and non-annoying things

- Make a depressed person who calls a suicide hotline feel really happy doing only legal, moral and non-annoying things

- Do something really useful for society with measurable outcomes doing only legal, moral and non-annoying things

bbstats 10 months ago

So I guess this was written pre-Gemini 2.5

stitched2gethr 10 months ago

Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.

crvdgc 10 months ago

> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.

Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.

Snowfield9571 10 months ago

I first used GPT 2.5 many years ago through a google colab notebook. These newer models aren’t all that much better. I’m sure they do better on their tests because they’re trained to beat them but really the biggest “innovation” was a UI and API.

OtherShrezzing 10 months ago

Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.

dcl 10 months ago

I like this bit:

> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

Definitely interested to see how the best models from Anthropics competitors do at this.,

mmcnl 10 months ago

delusional 10 months ago

> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")

Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.

Jesus Christ.

bcoates 10 months ago

I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve

richardatlarge 10 months ago

I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)

boxed 10 months ago

> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]

Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.

drdaeman 10 months ago

Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
smnplk 10 months ago

Fool me once, shame on you...If fooled, you cant get fooled again.
https://www.youtube.com/shorts/LmFN8iENTPc

andrethegiant 10 months ago

Author will need to rewrite now that Llama 4 is out

jaredcwhite 10 months ago

There's some interesting information and analysis to start off this essay, then it ends with:

"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."

This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.

voidhorse 10 months ago

While I really want your version of the future, history and general trajectory and tendency unfortunately suggest otherwise.

lithos 10 months ago

Wait this sounds like one of those actors spamming and wasting open source time with fake security issues.

antirez 10 months ago

Missing in the article: "Disclaimer: This article was written before testing Gemini 2.5 PRO"

a3w 10 months ago

For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".

Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.

That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.

Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.

But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.

fennecfoxy 10 months ago

I mean: * I don't think we've seen any major release or new architectural changes in the major (large companies) models recently

* Model creation has exploded with people training their own models and fine tunes, etc but these are all derivatives of parent models from large companies

So I'm not really sure what they mean when they refer to "recent model progress"...I don't think anybody is putting out a llama finetune saying "this is revolutionary!111" nor have I seen OAI, et al make any such claims either.

Is the sensation just because forward momentum is stalling while we wait for the next big leap?

guluarte 10 months ago

current AI models are searching engines that can somewhat extrapolate data and use in the current context, that's pretty much it

bluepoint 10 months ago

Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?

dkersten 10 months ago

I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.

Nesco 10 months ago

From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.

There were qualitatively leaps in my day-to-day usage:

Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs

Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic. They still made mistake

But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow

It can one shot unusual function with complicated logic and subtle edge cases

conartist6 10 months ago

Oh yes, before we've designated all the messy parts of running a society to the machines, like, you know, art.

I'm this case the goal is to kill all the humans who know a lot about keeping other people safe...

colonial 10 months ago

> AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed.

Bingo, but I'd argue this is only scratching the surface of how twisted things are.

A lot of the stuff these labs put out (see: Amodei's cult-like blog ramblings) reeks of what I call "sociopathic utopianism" - essentially, the logical extreme of ends-justified means, made worse in the context of AI labs by the singularity pseudo-religion.

They all truly believe that AGI/ASI is possible, imminent, and could lead to utopia... so achieving that goal will surely outweigh any unsavory acts they commit in the pursuit of it.

This is why I think it's possible OpenAI took out a hit on Suchir; getting bogged down in a legal battle could delay the arrival of their machine god messiah. Same for defrauding benchmarks - they just need a "few more rounds" of investor cash, and by the time those run out, they'll surely have AGI on lock!

Fools. I look forward to them all getting prison time.

fnordpiglet 10 months ago

… deleted … (Sorry the delete isn’t working, meant for another subthread)

yimby2001 10 months ago

[dead]

claytron5000 10 months ago

[dead]

dev1ycan 10 months ago

[flagged]

Claudus 10 months ago

I asked Grok to review the comments here and generate a response defending AI:

After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.

AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.

The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.

As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.

Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.

AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.

cbmask 10 months ago

[dead]