Comment by amelius
17 hours ago
Without benchmarks and/or a whole suite of non-cherrypicked examples, this means nothing because you can trivially make an AI generate anything from text.
17 hours ago
Without benchmarks and/or a whole suite of non-cherrypicked examples, this means nothing because you can trivially make an AI generate anything from text.
Working on benchmarks at the moment! Always open to feedback / PRs.
im def working on benchmarks for how my own general harness improves task performance vs same model in a commodity setup. its hard to do!
i will say that my current harness: https://github.com/cartazio/oh-punkin-pi is a testbed for a bunch of 2nd gen harness tech, largely optimized for reasoning llms only. the next one after this harness is gonna be epicccc