Comment by GodelNumbering

1 day ago

https://github.com/dirac-run/dirac#-evals

README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.

Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences

Going to assume you didnt capture the data but could you add time taken to completion for each if you have it?

re. bash type tooling-- it doesnt mean an agent cannot use ast: using treesitter cli this should be perfect possible

  • I assume that this benchmarks where done without any modifications to the default open-sourced harness. treesitter CLI would be an extra plugin for pi-mono, put I'd be equally curious about whether it would accomplish the task.