Comment by andrewshawcare

20 days ago

It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.

Hard to find fully specified problems like this in the wild.

I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.

I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

> Write extremely high-quality tests

> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

6 comments

andrewshawcare

tantalor 20 days ago

Why didn't Claude realize on its own that it needed a continuous integration pipeline?

Far to much human intervention here.

sublimefire 20 days ago

> Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.

krzat 20 days ago

You know what else is well specified? LLM improving on itself.

widdershins 20 days ago

I wouldn't describe intelligence as well specified. We can't even agree on what it is.

GalaxyNova 20 days ago

> Hard to find fully specified problems like this in the wild.

This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.

anematode 20 days ago

Impressive, my sarcasm/bait detector almost failed me.