← Back to context

Comment by tomtom1337

20 days ago

Any suggestions for «orchestrating» this type of experiment?

And how does one compare the results in a way that is easy to parse? 7 models producing 1 PR each is one way, but it doesn’t feel very easy to compare such.

https://github.com/voratiq/voratiq

For comparison, there's a `review` command that launches a sandboxed agent to review a given run and rank the various implementations. We usually run 1–3 review agents, pull the top 3 diffs, and do manual review from there.

We're working on better automation for this step right now.