Comment by bufferoverflow
5 months ago
And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).
paper says they created e2e tests to check if task completed successfully.