Comment by munksbeer
7 days ago
Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?
7 days ago
Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?
We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us