Comment by troupo

17 hours ago

> So what do you think the difference is between humans and an agent in this respect?

Humans learn.

Agents regurgitate training data (and quality training data is increasingly hard to come by).

Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.

> Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.

Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.

For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.

2 comments

troupo

aspenmartin 17 hours ago

> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.

I'm just going to ask that you read any of my other comments, this is not at all how coding agents work and seems to be the most common misunderstanding of HN users generally. It's tiring to refute it. RL in verifiable domains does not work like this.

> Humans learn.

Sigh, so do LLMs, in context.

> Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.

Literally benchmarks on this all over the place, I'm sure you follow them.

> Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.

and yet its not logarithmic? Consider data flywheel, consistent algorithmic improvements, synthetic data [basically: rejection sampling from a teacher model with a lot of test-time compute + high temperature],

> For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.

Benchmaxxing is for sure a real thing, not to mention even honest benchmarking is very difficult to do, but considering "all of the AI companies are just faking the performance data" to be the "story" is tremendously wrong. Consider AIME performance on 2025 (uncontaminated data), the fact that companies have a _deep incentive_ to genuinely improve their models (and then of course market it as hard as possible, thats a given). People will experiment with different models, and no benchmaxxing is going to fool people for very long.

If you think Opus 4.6 compared to Sonnet 3.x is "little progress" I think we're beyond the point of logical argument.

jwpapi 8 hours ago

Are you aware that LLms are still the same autocomplete just with different token decisions more data better pre and post training and settings
We have all the data now.
I don’t see where the huge gap should come from, as one person before they said they still make basic errors.
Models got better for a bunch of soft tuning. Language and abstractness is not really the same thing there are a lot of very good speakers that are terrible in logic and abstractness.
Thinking abstract sometimes makes it necessary to leave language and draw or som people even code in another coding language to get it.
We’ve seen it with the compiler project it’s nice looking but if you would want to make a competitive compiler you would be as far as starting fresh