← Back to context

Comment by zbrock

4 hours ago

Hello! I’m one of the three engineers who write this piece. Happy to answer questions.

Interesting write up!

Have you been able to extract libraries or tools from this project yet? If so how was that experience?

That is, do you see yourself releasing a metric harness, or sub-projects that are equivalent of ActiveRecord, zod, or similar open source tooling that frequently originate in a large in-house project - and then is exported out as a stand-alone toll, utility, library or framework?

Because while ai can reimplement minor tools, it's utility entirely depends on the existence of solid tools, libraries and frameworks.

Fantastic job!

Can you share what type of project that was? On the spectrum from a database engine to cat picture sharing web site (very high demand for correctness vs very lax).

Very cool article!

- are other teams adopting this approach? What’s the blockers if not?

- have there been problems where the models alone were not enough to debug and the devs had to fix it themselves?

- as the rate of changes has increased with more devs how have you dealt with concurrent writers with merge conflicts?

- if there was anything you could change in the approach you started with, what would it be?

  • 1. Yes! Many teams internally have adopted a lot of the same practices we outlined in the blog post. Ryan has also been spending time both internally and externally helping companies figure out how to do this in their code bases.

    2. Hmm, kind of. There have definitely been issues the models can’t one shot. But we still use Codex to write all the actual code with human guidance.

    3. More agents :) Some teams are experimenting with centralized Agent mediated integration queues, others use normal merge queues, many have local Codex threads that monitor CI to resolve and land conflicts or failures.

    4. Today’s models and codex app. We started doing all this with gpt-5 and codex-cli. The tools today, 9 months later, are so much better than what we had then.

    • Have you built any tooling or products around all of this and deploying it somehow? I’d love to learn more and share notes, because we’ve been doing this too. About 3100+ PRs merged across our 4 person team in 4 months. Impossible without harness engineering, and I agree, the tools are getting even better.

Have you been satisfied with the quality of code generated by the model? Or did you have to tweak some rule file or skill to improve it? Or is human-readable code not even a goal at this point?