← Back to context

Comment by onlyrealcuzzo

22 days ago

I'm seeing over and over again people claiming absurd optimizations for coding agents:

> Our tool uses 99x fewer tokens and delivers 88x better results.

Okay, great, but...

1) It's VERY difficult to quantify something is better.

2) They almost never post how they measured how much better it is and what the margin of error might be.

3) I assume they are incompetent and don't even try the tool.

Like you pointed out, the odds these things make agents worse is FAR higher than they make them better.

Not saying it's impossible, but if it was possible on the scales they are claiming, it probably would already be done, or put into the next release of the agents...

Hey, this skepticism is fair and we share it, which is why we don't claim end-to-end agent improvements since we haven't measured those (yet). The benchmark we published measures retrieval quality and token count during search, not overall agent performance. We are working on agent-level evals, but those are unfortunately much harder to get right. However, we do believe that Semble makes agents better based on our own experience of using it for the past months while in development (or at the very least, cheaper).

  • > We are working on agent-level evals, but those are unfortunately much harder to get right.

    It's unfortunately a nearly impossible task, as the models change regularly (without letting you know), so you have a moving (invisible) target that's 1) hard to test exhaustively, and 2) very expensive to test with any low margin of error.

    This is why no one does it and just makes broad sweeping unverified claims instead.

    If you figure out how to do it... You should probably just get a job at Anthropic or OpenAI and make $2M+ per year...