Comment by user34283

1 month ago

I find it hard to believe after running agents fully autonomously for a week you'd end up with something that actually compiles and at least somewhat functions.

And I'm an optimist, not one of the AI skeptics heavily present on HN.

From the post it sounds like the author would also doubt this when he talks about "glorified autocomplete and refactoring assistants".

10 comments

user34283

simonw 1 month ago

You don't run coding agents for a week and THEN compile their code. The best available models would have no chance of that working - you're effectively asking them to one-shot a million lines of code with not a single mistake.

You have the agents compile the code every single step of the way, which is what this project did.

user34283 1 month ago

With the agent running autonomously for a long time, I'd have feared it would break my build/verification tasks in an attempt to fix something.
My confidence in running an agent unsupervised for a long time is low, but to be fair that's not something I tried. I worked mostly with the agent in the foreground, at most I had two agents running at once in Antigravity.

Veserv 1 month ago

It did not compile [1], so your belief was correct.

[1] https://news.ycombinator.com/item?id=46649046

simonw 1 month ago
It did compile - the coding agents were compiling it constantly.
It didn't have correctly configured GitHub Actions so the CI build was broken.
- Veserv 1 month ago
  
  Then you should have no difficulty providing evidence for your claim. Since you have been engaging in language lawyering in this thread, it is only fair your evidence be held up to the same standard and must be incontrovertible evidence for your claims with zero wiggle room.
  Even though I have no burden of proof to debunk your claims as you have provided no evidence for your claims, I will point out that another commenter [1] indicates there were build errors. And the developer agrees there were build errors [2] that they resolved.
  [1] https://news.ycombinator.com/item?id=46650998
  
  4 replies →

santadays 1 month ago

That is a good point. It is impressive. Llms from two years ago were impressive, llms a year ago were impressive, and from a month ago even more impressive.

Still, getting "something" to compile after a week of work is very different from getting the thing you wanted.

What is being sold, and invested in, is the promise that LLMs can accomplish "large things" unaided.

But they can't, as of yet, they cannot, unless something is happening in one of the SOTA labs that we don't know about.

They can however accomplish small things unaided. However there is an upper bound, at least functionally.

I just wish everyone was on the same page about their abilities and their limitations.

To me they understand conext well (e.g. the task, build a browser doesn't need some huge specification because specifications already exist).

They can write code competently (this is my experience anyway)

They can accomplish small tasks (my experience again, "small" is a really loose definition I know)

They cannot understand context that doesn't exist (they can't magically know what you mean, but they can bring to bear considerable knowledge of pre-existing work and conventions that helps them make good assumptions and the agentic loop prompts them to ask for clarification when needed)

They cannot accomplish large tasks (again my experience)

It seems to me there is something akin to the context window into which a task can fit. They have this compact feature which I suspect is where this limitation lies. Ie a person can't hold an entire browser codebase in their head, but they can create a general top level mapping of the whole thing so they can know where to reach, where areas of improvement are necessary, how things fit together and what has been and what hasn't been implemented. I suspect this compaction doesn't work super well for agents because it is a best effort tacked on feature.

I say all this speculatively, and I am genuinely interested in whether this next level of capability is possible. To me it could go either way.