Comment by esafranchik

14 hours ago

Is the benchmark measuring one-shot retrieval accuracy, or Coding agent response accuracy?

5 comments

esafranchik

Reply

stephantul 14 hours ago

Hey! Co-author here. The benchmark currently only measures retrieval accuracy.

We’re interested in measuring it end to end and also optimizing, e.g. the prompt and tools, for this, but we just haven’t gotten around to it.

esafranchik 13 hours ago
Two follow-ups:
1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets?
2) How do you measure token use without the agent, prompt, and tools?
- stephantul 13 hours ago
  
  1) yes! It’s not accuracy, but ndcg 2) we assume that if the agent gets the correct answer in the returned snippets it does not need to read further
  
  2 replies →