Comment by esperent

6 hours ago

I did some evals with pi and GPT 5.5. I tested RTK on / headroom on / both on / both off (all with the standard pi system instructions and no AGENTS.md).

I forget the exact tests I used (a couple of the standard agent evals that people use, one python and one typescript because those are what I use).

I don't claim it was an exhaustive test, or even a good one. It's possible I could have spent a day or so tuning my AGENTS.md and the pi system prompt/tool instructions and gotten better results, because if there's one thing running evals taught me it's that subtle differences there can change the results a lot.

However, I got clearly better results with both off, enough to convince me to stop the tests immediately after 3 rounds.

The problem was that while context use did go down (sometimes), the number of turns to complete went up so the overall cost of the conversation was higher.

It's made me very aware of one thing: so many people are sharing these kind of tools, but either with zero evals (or suspiciously hard to reproduce), or in the case of this one, extensive benchmarks testing the wrong thing.

I'm sure this tool does use fewer tokens than grep, and the benchmarks prove it, but that's not what matters here. What matters is, does an agent using it get the same quality of work done more quickly and for lower cost?

There's an industry wide deficit of tests for AI right now. It's not just this tool, it's everything you add to your code base or your development flow that uses AI. Nobody had tests for "how fast/well was this developed" before AI and they haven't added them now.

with AI the "they could so they never wondered if they should" will be a very frequent thing.

  • This is a bit rude.

    We didn't generate this project, we wrote it, a lot of it manually, and trained custom models. We'd been working in the real-time retrieval space for a while, and we thought coding was a good fit for this specific technology.

    • My comment above wasn't meant to be rude. And you do have extensive benchmarks against grep etc so it's clear you understand the importance of that.

      But I still think you're missing the harder but more important proof which is agent evals. Have you done any of that?

      I would personally love to find tools in this space which can make agents more efficient and I do believe there's a scope for massive improvements compared to default workflows. But my evals with RTK and Headroom have made me wary that a tool can look like it should work, conceptually make sense, pass non-agentic benchmarks, and still make an actual agentic workflow worse.

      1 reply →

  • yeah I think I'm prone to do the same, it is so easy to create and we get too excited by it instead of first doing the research necessary which is much more boring than actually producing something.