← Back to context

Comment by zelphirkalt

2 months ago

Can anyone tell me what is the difficulty in simply not having .git at all during a benchmark run? Why not simply remove anything that is not the code the benchmark runs on? Or just simple oversight?

Coding agents are so powerful because they are not just looking at static code. Looking through git histories is a valid method for humans to solve certain kinds of bugs, so it makes sense that models should also be able to do that too. And realistically, a lot of modern production code will have git information, so it's not like this wouldn't be a common real world application.

  • That is a weak argument.

    The point is to benchmark against a human solving a problem. Typically these problems are posed as a question or a blank project, without that history.

    You are arguing for a an apples to oranges comparison because the LLM performs better. Rather than a realistic comparison.

    • You apparently don't know what SWE-bench is [1]. First of all, it tries to evaluate skills that explicitly go beyond blank project questions with given solutions. Secondly, it does not contain "optimal" or sometimes even correct solutions. That's because it uses real world software development examples from actual PRs in popular repos. These very likely had humans use all the tools at their disposal as well (e.g. web search, git commands, code execution). Assuming an LLM could have solved these just by looking at a piece of code turns out to be very myopic.

      [1] https://arxiv.org/html/2310.06770v3

      2 replies →

  • I think this issue is specifically about the agents looking at "future repository state" (according to the linked issue at least), so while looking at the history might be a normal method for solving issues, running `git log --all` to take a peek at the future which already includes the fix isn't very typical (yet?).

  • Well, there's legacy code and/or horrible git history that also needs fixing at some point. Also I have witnessed how the history can send you down a wrong path. I don't agree that this is a good argument.