Comment by afro88

7 days ago

We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us

0 comments