Comment by munksbeer

7 days ago

Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?

1 comment

munksbeer

afro88 7 days ago

We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us