Scaling long-running autonomous coding

20 days ago (simonwillison.net)

Related: Scaling long-running autonomous coding - https://news.ycombinator.com/item?id=46624541 - Jan 2026 (187 comments)

111 comments

srameshc

light_hue_1 19 days ago

Browsers are pretty much the best case scenario for autonomous coding agents. A totally unique situation that mostly doesn't occur in the real world.

At a minimum:

1. You've got an incredibly clearly defined problem at the high level.

2. Extremely thorough tests for every part that build up in complexity.

3. Libraries, APIs, and tooling that are all compatible with one another because all of these technologies are built to work together already.

4. It's inherently a soft problem, you can make partial progress on it.

5. There's a reference implementation you can compare against.

6. You've got extremely detailed documentation and design docs.

7. It's a problem that inherently decomposes into separate components in a clear way.

8. The models are already trained not just on examples for every module, but on example browsers as a whole.

9. The done condition for this isn't a working browser, it's displaying something.

This isn't a realistic setup for anything that 99.99% of people work on. It's not even a realistic setup for what actual developers of browsers do who must implement new or fuzzy things that aren't in the specs.

Note 9. That's critical. Getting to the point where you can show simple pages is one thing. Getting to the point where you have a working production browser engine, that's not just 80% more work, it's probably considerably more than 100x more work.

maleldil 18 days ago

It's a good benchmark for how agents can write very complex code. Browsers are likely among the most complex programs we have today (arguably more complex than many OSs). Even if the problem is well-defined, many sceptics would still say the complexity is beyond what agents can handle.
polyglotfacto 18 days ago

So first of all, as per my other comments on this threads and coming from a browser engineer: the autonomous coding agents failed miserably.
Whether it is the best case scenario in terms of benchmark, I am not so sure.
The Web is indeed standardized and there are many open-source implementations out there. But how to implement the Web in a novel way by definition means you are trying to solve some perceived problem with existing implementations.
So I would rephrase your statement as such: rewriting an existing engine in another language without any novelty might be the best case scenario for autonomous coding agents.
As an example of approaching the problem in a novel way: the Fastrender code seems obsessed with metering of resources. Implementing the Web with that constraint in mind would be an interesting problem and not obvious at all. That's not what the project is doing so far by the way, since the code is quite frankly a bunch of spaghetti that does not follow Web standards at all(in a way that is unrelated to the metering story, so the divergence from specs is not novel, it's just wrong).

simonw 20 days ago

One of the big open questions for me right now concerns how library dependencies are used.

Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.

The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.

I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.

sealeck 20 days ago
I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.
I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.
Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.
I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.
- simonw 20 days ago
  
  Yeah, I'm hoping they publish a lot more about this project! It deserves way more then the few sentences they've shared about it so far.
  
  1 reply →
- polyglotfacto 18 days ago
  
  I think the current approach is simply not scalable to a working browser ever.
  To leverage AI to build a working browser you would imo need the following:
  - A team of humans with some good ideas on how to improve on existing web engines.
  - A clear architectural story written not by agents but by humans. Architecture does not mean high-level diagrams only. At each level of abstraction, you need humans to decide what makes sense and only use the agent to bang out slight variations.
  - A modular and human-overseen agentic loop approach: one agent can keep running to try to fix a specific CSS feature(like grid), with a human expert reviewing the work at some interval(not sure how fine-grained it should be). This is actually very similar to running an open-source project: you have code owners and a modular review process, not just an army of contributor committing whatever they want. And a "judge agent" is not the same thing as a human code owner as reviewer.
  Example on how not to do it: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...
  This rendering loop architecture makes zero sense, and it does not implement web standards.
  > in the HTML Standard, requestAnimationFrame is part of the frame rendering steps (“update the rendering”), which occur after running a task and performing a microtask checkpoint
  > requestAnimationFrame callbacks run on the frame schedule, not as normal tasks.
  This is BS: "update the rendering" is specified as just another task, which means it needs to be followed by a microtask checkpoint. See https://html.spec.whatwg.org/multipage/#event-loop-processin...
  Following the spec doesn't mean you cannot optimize rendering tasks in some way vs other tasks in your implementation, but the above is not that, it's classic AI bs.
  Understanding Web standards and translating them into an implementation requires human judgement.
  Don't use an agent to draft your architecture; an expert in web standards with a interest in agentic coding is what is required.
  Message to Cursor CEO: next time, instead of lighting up those millions on fire, reach out to me first: https://github.com/gterzian
  
  4 replies →
mwcampbell 19 days ago

I was gratified to learn that the project used my own AccessKit for accessibility (or at least attempted to; I haven't verified if it actually works at all; I doubt it)... then horrified to learn that it used a version that's over 2 years old.
embedding-shape 19 days ago

For me, the biggest open question is currently "How autonomous is 'autonomous'?" because the commits make it clear there were multiple actors involved in contributing to the repository, and the timing/merges make it seem like a human might have been involved with choosing what to merge (but hard to know 100%) and also making smaller commits of their own. I'm really curious to understand what exactly "It ran uninterrupted for one week" means, which was one of Cursor's claims.
I've reached out to the engineer who seemed to have run the experiment, who hopefully can shed some more light on it and (hopefully) my update to https://news.ycombinator.com/item?id=46646777 will include the replies and more investigations.
shubhamjain 20 days ago
Why attempt something that has abundant number of libraries to pick and choose? To me, however impressive it is, 'browser build from scratch' simply overstates it. Why not attempt something like a 3D game where it's hard to find open source code to use?
- Banditoz 20 days ago
  
  Is something like a 3D game engine even hard to find source code for? There's gotta lots of examples/implementations scattered around.
- cheevly 20 days ago
  
  Assets are very hard to produce and largely unsolved by AI at the moment.
  
  3 replies →
- XenophileJKO 19 days ago
  
  There are a lot of examples out there. Funny that you mention this. I literally just last night started a "play" project having Claude Code build a 3D web assembly/webgl game using no frameworka. It did it, but it isn't fun yet.
  I think the current models are at a capability level that could create a decent 3D game. The challenges are creating graphic assets and debugging/Qa. The debugging problem is you need to figure out a good harness to let the model understand when something is working, or how it is failing.
- fulafel 19 days ago
  
  There's many open source ones around.
  Also graphics acceleration makes it hard to do from scratch rather than using using the 3D APIs but I guess you could in principle go bare iron on hardware that has published specs such as AMD, or just do software only rendering.
janoelze 20 days ago
Any views on the nature of "maintainability" shifting now? If a fleet of agents demonstrated the ability to bootstrap a project like that, would that be enough indication to you that orchestration would be able to carry the code base forward? I've seen fully llm'd codebases hit a certain critical weight where agents struggled to maintain coherent feature development, keeping patterns aligned, as well as spiralling into quick fixes.
- simonw 20 days ago
  
  Almost no idea at all. Coding agents are messing with all 25+ years of my existing intuitions about what features cost to build and maintain.
  Features that I'd normally never have considered building because they weren't worth the added time and complexity are now just a few well-structured prompts away.
  But how much will it cost to maintain those features in the future? So far the answer appears to be a whole lot less than I would previously budget for, but I don't have any code more than a few months old that was built ~100% by coding agents, so it's way too early to judge how maintenance is going to work over a longer time period.
  
  2 replies →
- brianjeong 20 days ago
  
  I think there's a somewhat valid perspective that the Nth+1 model can simply clean up the previous models mess.
  Essentially a bet that the rate of model improvement is going to be faster than the rate of decay from bad coding.
  Now this hurts me personally to see as someone who actually enjoys having quality code but I don't see why it doesn't have a decent chance of holding
- Deevian 19 days ago
  
  They demonstrated the ability to bootstrap... "something". There's no maintainability to the output of the experiment.
teaearlgraycold 20 days ago
It looks like JS execution is outsourced to QuickJS?
- simonw 19 days ago
  
  No, it has its own JS implementation: https://news.ycombinator.com/item?id=46650998

tabs_or_spaces 19 days ago

> I think somebody will have built a full web browser mostly using AI assistance, and it won’t even be surprising

> When I made my 2029 prediction this is more-or-less the quality of result I had in mind.

There seems to be a lot of compensation and leniency made by the author here.

So, it is seemingly impressive that someone was able to use agents to build a browser.

But they used trillions of tokens? This equates to millions of dollars of spend. Are we really happy with this?

The browser itself is not fully complete. There's rendering glitches stated in the article. So millions of dollars for something that has obvious bugs.

This is also pure agent code. Can a code base like this ever be maintained by a team of humans? Are you vendor locked into a specific model if you want to build more features? How will support work? How will releases work? The lack of reflection over the rest of the software lifecycle except building is shocking.

So I'm not sure after reflecting, whether any of this is impressive outside of "someone with unlimited tokens built a browser using ai agents". It's the same class of problem being solved over and over again. Nothing new is really being done here.

Maybe it's just me but there's much more to software than just building.

simianwords 18 days ago
>But they used trillions of tokens? This equates to millions of dollars of spend. Are we really happy with this?
Yes, arguably 5 million is a fair price and cheaper than what it would take to pay humans.
- m4nu3l 18 days ago
  
  There is a problem with this comparison. The agent had access to open-source browsers in its training set. So you'd need to compare the cost of creating an equivalent browser for a developer who has access to them, too. If all you need is standard browser functionality, you just use an existing browser. If you want to change some features or parts of the implementation, you fork it. A new browser written from scratch would be valuable if it had a novel implementation that resulted in a faster/more secure/robust/memory efficient or simply easier-to-use browser. So even if this had implemented the standard correctly, it wouldn't be worth more than the time it takes a developer to fork Chromium and change its name. Don't get me wrong, it's impressive, but not as impressive after you think that an LLM that regurgitates verbatim the code of Chromium when tasked to build a browser would have effectively succeeded at the task.
  EDIT: About the rendering speed. It doesn't really make sense to compare it with a fully functioning browser, as you could potentially drop features or make bogus optimisations to go faster.
- polyglotfacto 18 days ago
  
  If you paid 5 cents for the code you would have been ripped off; it's throw-away stuff.
laterium 19 days ago
If an AI system autonomously built a rocket and went to the moon, would you call it unimpressive because it's already been done? The moving of goalposts is shocking.
- polyglotfacto 18 days ago
  
  As I explained elsewhere in this thread, the results here are more like trying to launch a rocket to the moon, unleashing AI on the problem, and settling for some kind of giant firecracker as a POC.
  This isn't a POC web engine; it's throw-away code that can never scale to a full web engine.
  So instead of wasting millions on this autonomous run, they should have put together a small team of people with some ideas on how to improve on existing web engines, and then give that team a large token development budget. You could get a nice POC after a couple of weeks, and after a year or two of further iterations you might have something really interesting.
  So this is a great example of how AI fails when left unsupervised; a more interesting experiment would be about how a small team can leverage AI to leapfrog Chromium; not in one week but in a year or two.

andrewchambers 19 days ago

Test suites just increased in value by a lot and code decreased in value.

utopiah 19 days ago
Doubt it, code will be generated to pass tests, not the intent behind the tests.
- daxfohl 19 days ago
  
  A million times, this. Sometimes they luck into the intent, but much more frequently they end up in a ball of mud that just happens to pass the tests.
  "8 unit tests? Great, I'll code up 8 branches so all your tests pass!" Of course that neglects the fact that there's now actually 2^8 paths through your code.
- Art9681 19 days ago
  
  What makes you think the next generation models won't be explicitly trained to prevent this, or any other pitfall or best practice as the low hanging fruit fall one by one?
- andrewchambers 19 days ago
  
  I think we agree - getting the llms to understand your intent is the hard part, at the very least you need well specified tests.
  Perhaps more advanced llms + specifications + better tests.
- krashidov 19 days ago
  
  if you can steer an LLM to write an application based on what you want, you can steer an LLM to write the tests you want. Some people will be better at getting the LLM to write tests, but it's only going to get easier and easier
__mharrison__ 19 days ago

This is one of the reasons why I just wrote a testing book (beta reviews giving feedback now). Testing is one of those boring subjects that many programmers ignore. But it just got very relevant. Especially TDD.
well_ackshually 19 days ago
No, OP is merely an AI deepthroater that will blindly swallow whatever drivel is put out by AI companies and then "benchmark" it by having it generate a pelican (oh and he got early access to the model), then call whatever he puts out "AI optimism"
The reality of things is, AI still can't handle long running tasks without blowing $500k worth of tokens for an end result that doesn't work, and further work is another $100k worth to get nothing novel.
- Xmd5a 19 days ago
  
  Where are you pulling these numbers from? I'm genuinely interested. Is it the kind of budget you need to spend in order to have Claude build a Word clone?

retinaros 20 days ago

Agentic coding is a card castle built on another card castle (test time compute) built on another card castle (token prediction) the mere fact that using lot of iterations and compute works maybe tells us that nothing is really elegant about the things we craft.

halfcat 20 days ago

So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.

AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.

If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?

The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

tornikeo 20 days ago

> The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?
It's nearly frictionless, not frictionless because someone has to use the output (or at least verify it works). Also, why do you think the "shape" of the knowledge is spherical? I don't assume to know the shape but whatever it is, it has to be a fractal-like, branching, repeating pattern.
ramraj07 20 days ago
The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
- omnicognate 19 days ago
  
  Why doubt? Transformers are a form of kernel smoothing [1]. It's literally interpolation [2]. That doesn't mean it can only echo the exact items in its training data - generating new data items is the entire point of interpolation - but it does mean it's "remixing" (literally forming a weighted sum of) those items and we would expect it to lose fidelity when moving outside the area covered by those points - i.e. where it attempts to extrapolate. And indeed we do see that, and for some reason we call it "hallucinating".
  The subsequent argument that "LLMs only remix" => "all knowledge is a remix" seems absurd, and I'm surprised to have seen it now more than once here. Humanity didn't get from discovering fire to launching the JWST solely by remixing existing knowledge.
  [1] http://bactra.org/notebooks/nn-attention-and-transformers.ht...
  [2] Well, smoothing/estimation but the difference doesn't matter for my point.
  
  2 replies →
- mrbungie 20 days ago
  
  > Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
  You know this is a false dichotomy right? You can treat and consider LLMs statistical parrots and at the same time take advantage of them.
  
  1 reply →
- theshrike79 18 days ago
  
  There are musicians who "remix" (sample) other artists music and make massive hits themselves.
  Not every solution needs to be unique, in many cases "remixing" existing solutions in an unique way is better and faster.
- pseudosavant 20 days ago
  
  But all of my great ideas are purely from my own original inspiration, and not learning or pattern matching. Nothing derivative or remixed. /sarcasm
- heavyset_go 20 days ago
  
  Yeah, Yann LeCun is just some luddite lol
  
  6 replies →
ukuina 20 days ago

Single-idea implementations ("one-trick ponies") will die off, and composites that are harder to disassemble will be worth more.

vedmakk 20 days ago

After reading that post it feels so basic to sit here, watching my single humble claude code agent go along with its work... confident, but brittle and so easily distracted.

swordsith 19 days ago

It does feel like these multi-agent coding types are going to code themselves out of existence.

ramon156 19 days ago

I would also love to see the statistics regarding token cost, electricity cost, environmental damage etc.

Not saying that this only happens with LLMs, in fact it should be compared against e.g. a dev team of 4-5

cocoto 19 days ago
The complex thing is that you would need to take into account the energy used to feed the programmers, the energy used for their education or simply them growing up to the age they are working. For the LLMs it would have to take into account energy used for the GPU, the machine building the GPUs, datacenters, engineers maintaining it, their education etc etc. It’s so complex to really estimate these things from bottom up if you are not only looking locally, it feels impossible…
- marisen 19 days ago
  
  It is well known that a programmer that stops programming stops requiring food
  
  1 reply →
- AstroBen 19 days ago
  
  Yes. I absolutely agree. To fully optimize this system we must enact breeding restrictions to reduce the energy demands from the humans.
- xnx 19 days ago
  
  > It’s so complex to really estimate these things
  Is it? Use dollar cost of salary and cost for the AI. That wraps up all those things you mentioned.
xnx 19 days ago
Generally, if something costs less it has less environmental impact.
- bauerd 19 days ago
  
  Generally wrong. It may cost less because its externalities aren't priced in.
- oefrha 19 days ago
  
  If you exterminate the replaced human coders, sure.

Chipshuffle 19 days ago

The more I think about LLMs the stranger it feels trying to grasp what they are. To me, when I'm working with them, they don't feel intelligence but rather an attempt at mimicking it. You can never trust, that the AI actually did something smart or dump. The judge always has to be you.

It's ability to pattern match it's way through a code base is impressive until it's not and you always have to pull it back to reality when it goes astray.

It's ability to plan ahead is so limited and it's way of "remembering" is so basic. Every day it's a bit like 50 first dates.

Nonetheless seeing what can be achieved with this pseudo intelligence tool makes me feel a little in awe. It's the contrast between not being intelligence and achieving clearly useful outcomes if stirred correctly and the feeling that we just started to understand how to interact with this alien.

Gazoche 19 days ago
> they don't feel intelligence but rather an attempt at mimicking it
Because that's exactly what they are. An LLM is just a big optimization function with the objective "return the most probabilistically plausible sequence of words in a given context".
There is no higher thinking. They were literally built as a mimicry of intelligence.
- azan_ 19 days ago
  
  > Because that's exactly what they are. An LLM is just a big optimization function with the objective "return the most probabilistically plausible sequence of words in a given context". > There is no higher thinking. They were literally built as a mimicry of intelligence.
  Maybe real intelligence also is a big optimization function? Brain isn't magical, there are rules that govern our intelligence and I wouldn't be terribly surprised if our intelligence in fact turned out to be kind of returning the most plausible thoughs. Might as well be something else of course - my point is that "it's not intelligence, it's just predicting next token" doesn't make sense to me - it could be both!
- encyclopedism 19 days ago
  
  I don't understand why this point is NOT getting across to so many on HN.
  LLM's do not think, understand, reason, reflect, comprehend and they never shall. I have commented elsewhere but this bears repeating
  If you had enough paper and ink and the patience to go through it, you could take all the training data and manually step through and train the same model. Then once you have trained the model you could use even more pen and paper to step through the correct prompts to arrive at the answer. All of this would be a completely mechanical process. This really does bear thinking about. It's amazing the results that LLM's are able to acheive. But let's not kid ourselves and start throwing about terms like AGI or emergence just yet. It makes a mechanical process seem magical (as do computers in general).
  I should add it also makes sense as to why it would, just look at the volume of human knowledge (the training data). It's the training data with the mass quite literally of mankind's knowledge, genius, logic, inferences, language and intellect that does the heavy lifting.
  
  8 replies →
- clbrmbr 19 days ago
  
  Life is more fun as a scruffie.
  [0] http://www.catb.org/~esr/jargon/html/N/neats-vs--scruffies.h...
visarga 19 days ago

> The judge always has to be you.
But you can automate much of that work by having good tests. Why vibe-test AI code when you can code-test it? Spend your extra time thinking how to make testing even better.
cess11 19 days ago

It's a compressed database with diffuse indices. It's using probability matching rather than pattern matching. Write operations are called 'training' and 'fine-tuning'.
NiloCK 19 days ago
If you find yourself 50-first-dating your LLMs, it may be worth it to invest some energy into building up some better context indexing of both the codebase itself and of your roadmap.
- Chipshuffle 19 days ago
  
  Yeah, I admit I'm probably not doing that quite optimally. I'm still just letting the LLM generate ephemeral .md files that I delete after a certain task is done.
  The other day I found [beads](https://github.com/steveyegge/beads) and thought maybe that could be a good improvement over my current state.
  But I'm quite hesitant because I also have seen these AGENTS.md files become stale and then there is also the question of how much information is too much especially with the limited context windows.
  Probably all things that could again just be solved by leveraging AI more and I'm just an LLM noob. :D
  
  1 reply →

gforce_de 18 days ago

Wow, for screenshots much faster than chromium:

  $ time target/release/fetch_and_render "https://www.lauf-goethe-lauf.de/"
  real 0m0,685s
  user 0m0,548s
  sys 0m0,070s
  
  $ time chromium --headless --disable-gpu --screenshot=out.png --window-size=1200,800 https://www.lauf-goethe-lauf.de/
  real 0m1,099s
  user 0m0,927s
  sys 0m0,692s

# edit: with a hot-standby chrome and a running node instance a can reach 0,369s seconds here

tinyhouse 20 days ago

Well, software is measured over time. The devil is always in the details.

Deevian 19 days ago

Looking at the code it produced, the details literally look like something straight out of hell, so you're not far off.
aronowb14 20 days ago

Yeah curious what would happen if they asked for an additional big feature on top of the original spec

polyglotfacto 19 days ago

I'm a maintainer of Servo which is another web engine project.

Although I dissented on the decision, we banned the use of AI. Outside of the project I've been enjoying agentic coding and I do think it can be used already today to build production-grade software of browser-like complexity.

But this project shows that autonomous agents without human oversight is not the way forward.

Why? Because the generated code makes little sense from a conceptual perspective and does not provide a foundation on which to eventually build an entire web engine.

For example, I've just looked into the IndexedDB implementation, which happens to be what I am working on at the moment in Servo.

Now, my work in Servo is incomplete, but conceptually the code that is in place makes sense and there is a clear path towards eventually implementing the thing as a whole.

In Fastrender, you see an Arc<Mutex<Database>> which is never going to work, because by definition a production browser engine will have to involve multiple processes. That doesn't mean you need the IPC in a prototype, but you certainly should not have shared state--some simple messaging between threads or tasks would do.

The above is an easy coding fix for the AI, but it requires input from a human with a pretty good idea of what the architecture should look like.

For comparison, when I look at the code in Ladybird, yet another browser project, I can immediately find my way around what for me is a stranger codebase: not just a single file but across large swaths of the project and understand things like how their rendering loop works. With Fastrender I find it hard to find my way around, despite all the architectural diagrams in the README.

So what do I propose instead of long-running autonomous agents? The focus should shift towards demonstrating how AI can effectively assist humans in building well-architected software. The AI is great at coding, but you eventually run into what I call conceptual bottlenecks, which can be overcome with human oversight. I've written about this elsewhere: https://medium.com/@polyglot_factotum/on-writing-with-ai-87c...

There is one very good idea in the project: adding the web standards directly in the repo so it can be used as context by the AI and humans alike. Any project can apply this by adding specs and other artifacts right next to the code. I've been doing this myself with TLA+, see https://medium.com/@polyglot_factotum/tla-in-support-of-ai-c...

To further ground the AI code output, I suggest telling it to document the code with the corresponding lines from the spec.

Back in early 2025 when we had those discussions in Servo about whether to allow some use of AI, I wrote this guide https://gist.github.com/gterzian/26d07e24d7fc59f5c713ecff35d... which I think is also the kind of context you want to give the AI. Note that this was back in the days of accepting edits with tabs...

daxfohl 19 days ago
Though the fact that the code is so incoherent and inconsistent plausibly makes it more impressive that they still managed to make something that works at all, and weakens the argument that "all they did was copy/translate some existing other things to Rust."
That said, it's possible that none of that code even gets executed at run time, and the only code that is actually run is some translated glue code, with the other million lines essentially dead, so who knows.
- polyglotfacto 19 days ago
  
  I don't think it's all copy/pasted; it is quite an original byzantine architecture.
  You're right that lots of code appears only used in unit-tests, of which there is an enormous amount(making it hard to tell whether what is being tested makes sense). In Servo we don't have a single line of unit-tests in the script component, because all of it is covered by the WPT integration test suite shared with all other engines...
simonw 19 days ago
Thanks for this, that was a really informative comment.
- polyglotfacto 19 days ago
  
  You're welcome; big fan of your blog and a former Django dev myself.
  Just made some last edits above so not sure which version you saw. I toned it down a bit and clarified some stuff...

daxfohl 19 days ago

So we've graduated from unmaintainable slop code to unusable slop products. Sorry, this just doesn't feel like progress toward any meaningful future. But I'm sure it will unburden lots of investors of their money.

daxfohl 19 days ago

The whole industry is like one of those projects that claims "90% finished" from the time of the first demo, then for the next N years, all the way up until the project is eventually canceled. Except this project already has trillions of dollars at stake.

Agent_Builder 20 days ago

[flagged]

faeyanpiraat 19 days ago

please stop spamming about your tool

anilgulecha 20 days ago

That's a wild idea-a browser from scratch! And ladybird has been moving at snails pace for a long time..

I think a good abstractions design and good test suite will make it break success of future coding projects.

vivzkestrel 20 days ago

I am waiting for that guy or a team that uses LLMs to write the most optimal version of Windows in existence, something that even surpasses what Microsoft has done over the years and honestly looking at the current state of Windows 11, it really feels like it shouldn't even be that hard to make something more user friendly

kimixa 20 days ago
Considering Microsoft's significant (and vocal) investment in LLMs, I fear the current state of Windows 11 is related to a team trying to do exactly that.
- g947o 20 days ago
  
  I noticed that dialog that has worked correctly for the past 10+ years is using a new and apparently broken layout. Elements don't even align properly.
  It's hard to imagine a human developer misses something so obvious.
bandrami 19 days ago

The problem there is the same problem with AI-generated commercial feature films: the copyrightability of the output of LLMs remains an unexplored morass of legal questions, and no big name is going to put their name on something until that's adjudicated.