← Back to context

Comment by fnordpiglet

5 days ago

Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.

>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.

  • I don't get it, how do you "big data cheat" an AI into solving previously unencountered problems? Wouldn't that just be engineering?

    • I don’t know how you’d cheat at it either, but if you could, it would manifest as the model getting gold on the test and then in six months when its released to the public, exhibiting wild hallucinations and basic algebra errors. I don’t kbow if that’s how it’ll play out this time but I know how it played out the last ten.

    • It depends on what you mean by "engineering". For example, "engineering" can mean that you train and fine-tune a machine learning system to beat a particular benchmark. That's fun times but not really interesting or informative.

    • > previously unencountered problems

      I haven't read the IMO problems, but knowing how math Olympiad problems work, they're probably not really "unencountered".

      People aren't inventing these problems ex nihilo, there's a rulebook somewhere out there to make life easier for contest organizers.

      People aren't doing these contests for money, they are doing them for honor, so there is little incentive to cheat. With big business LLM vendors it's a different situation entirely.

    • I mean, solutions for the 2025 IMO problems are already available on the internet. How can we be sure these are “unencountered” problems?

      2 replies →

Without sharing their methodology, how can we trust the claim ? questions like:

1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..

I am excited to hear about it, but I remain skeptical.

>Why is that less exciting?

Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.

People would probably not be as excited about the prospect of employing me to throw rocks for them.

  • If you don't have automatic way to verify solution then picking correct answer from 10 000 is more impressive than coming with some answer in the first place. If AI tech will be able to effectively prune tree search without eval that would be super big leap but I doubt they achieved this.

  • It’s exciting because nearly all humans have 0% chance of throwing the rock into the bucket, and most people believed a rock-into-bucket-thrower machine is impossible. So even an inefficient rock-into-bucket-thrower is impressive.

    But the bar has been getting raised very rapidly. What was impressive six months ago is awful and unexciting today.

    • You're putting words in my mouth. It's not "awful and unexciting", it is certainly an important step, but the hype being invited with the headline is the immensely greater one of an accurate rock-thrower. And if they have the inefficient one and trying to pretend to have the real deal, that's flim-flam-man levels of overstatement.

I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

If you're not familiar with System 1 / System 2, it's googlable .

  • > These models cannot reason

    Not trying to be a smarty pants here, but what do we mean by "reason"?

    Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.

    It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.

    It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.

    What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.

    It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?

    I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

    • I think the biggest hint that the models aren't reasoning is that they can't explain their reasoning. Researchers have shown for explained that how a model solves a simple math problem and how it claims to have solved it after the fact have no real correlation. In other words there was only the appearance of reasoning.

      6 replies →

    • I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

      I’m using Opus 4 for coding and there is no way that model demonstrates any reasoning or demonstrates any “intelligence” in my opinion. I’ve been through the having conversations phase etc but doesn’t get you very far, better to read a book.

      I use these models to help me type less now, that’s it. My prompts basically tell it to not do anything fancy and that works well.

      1 reply →

    • You raise a far point. These criticisms based on "it's merely X" or "it's not really Y" don't hold water when X and Y are poorly defined.

      The only thing that should matter is the results they get. And I have a hard time understanding why the thing that is supposed to behave in an intelligent way but often just spew nonsense gets 10x budget increases over and over again.

      This is bad software. It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output is garbage software. Sink 10x, 100x, 1000x more resources into it is irrational.

      Nothing else matters. Maybe it reasons, maybe it's intelligent. If it produces garbled nonsense often, giving the teams behind it 10x the compute is insane.

      4 replies →

  • >I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result

    This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.

Because the usefulness of an AI model is reliably solving a problem, not being able to solve a problem given 10,000 tries.

Claude Code is still only a mildly useful tool because it's horrific beyond a certain breadth of scope. If I asked it to solve the same problem 10,000 times I'm sure I'd get a great answer to significantly more difficult problems, but that doesn't help me as I'm not capable of scaling myself to checking 10,000 answers.