Comment by btown

4 days ago

It bears repeating that modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite. It seems like this problem did (at least for some finite subset of n)!

This result, by itself, does not generalize to open-ended problems, though, whether in business or in research in general. Discovering the specification to build is often the majority of the battle. LLMs aren't bad at this, per se, but they're nowhere near as reliably groundbreaking as they are on verifiable problems.

15 comments

btown

utopiah 4 days ago

> modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite.

Feel like it's a bit what I tried to expressed few weeks ago https://news.ycombinator.com/item?id=46791642 namely that we are just pouring computational resources at verifiable problems then claim that astonishingly sometimes it works. Sure LLMs even have a slight bias, namely they do rely on statistics so it's not purely brute force but still the approach is pretty much the same : throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".

virgildotcodes 4 days ago
> throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".
What do we think humans are doing? I think it’s not unfair to say our minds are constantly trying to assemble the pieces available to them in various ways. Whether we’re actively thinking about a problem or in the background as we go about our day.
Every once in a while the pieces fit together in an interesting way and it feels like inspiration.
The techniques we’ve learned likely influence the strategies we attempt, but beyond all this what else could there be but brute force when it comes to “novel” insights?
If it’s just a matter of following a predefined formula, it’s not intelligence.
If it’s a matter of assembling these formulas and strategies in an interesting way, again what else do we have but brute force?
- utopiah 4 days ago
  
  See what I replied just earlier https://news.ycombinator.com/item?id=47011884 namely the different regimes, within paradigm versus challenging it by going back to first principles. The ability to notice something is off beyond "just" assembling existing pieces, to backtrack within the process when failures get too many and actually understand the relationship is precisely different.
  
  1 reply →
- tsimionescu 4 days ago
  
  While I don't think anyone has a plausible theory that goes to this level of detail on how humans actually think, there's still a major difference. I think it's fair to say that if we are doing a brute force search, we are still astonishingly more energy efficient at it than these LLMs. The amount of energy that goes into running an LLM for 12h straight is vastly higher than what it takes for humans to think about similar problems.
  
  3 replies →
- sheepscreek 4 days ago
  
  The field of medicine - pharmacology and drug discovery, is an optimized version of that. It works a bit like this:
  Instead of brute-forcing with infinite options, reduce the problem space by starting with some hunch about the mechanism. Then the hard part that can take decades: synthesize compounds with the necessary traits to alter the mechanism in a favourable way, while minimizing unintended side-effects.
  Then try on a live or lab grown specimen and note effectiveness. Repeat the cycle, and with every success, push to more realistic forms of testing until it reaches human trials.
  Many drugs that reach the last stage - human trials - often end up being used for something completely other than what they were designed for! One example of that is minoxidil - designed to regular blood pressure, used for regrowing hair!
  
  1 reply →
piombisallow 4 days ago
That's also what most grad students are doing. Even in the unlikely case they completely stop improving, it's still a massive deal.
- btown 2 days ago
  
  Once heard someone call it "graduate student descent" and I've never heard a more apt term!

QuercusMax 4 days ago

Yes, this is where I just cannot imagine completely AI-driven software development of anything novel and complicated without extensive human input. I'm currently working in a space where none of our data models are particularly complex, but the trick is all in defining the rules for how things should work.

Our actual software implementation is usually pretty simple; often writing up the design spec takes significantly longer than building the software, because the software isn't the hard part - the requirements are. I suspect the same folks who are terrible at describing their problems are going to need help from expert folks who are somewhere between SWE, product manager, and interaction designer.

D-Machine 4 days ago

Even more generally than verification, just being tied to a loss function that represent something we actually care about. E.g. compiler and test errors, LEAN verification in Aristotle, basic physics energy configs in AlphaFold, or win conditions in e.g. RL, such as in AlphaGo.

RLHF is an attempt to push LLMs pre-trained with a dopey reconstruction loss toward something we actually care about: imagine if we could find a pre-training criterion that actually cared about truth and/or plausibility in the first place!

alex43578 4 days ago

There's been active work in this space, including TruthRL: https://arxiv.org/html/2509.25760v1. It's absolutely not a solved problem, but reducing hallucinations is a key focus of all the labs.