← Back to context

Comment by Bibabomas

22 days ago

Hey, this skepticism is fair and we share it, which is why we don't claim end-to-end agent improvements since we haven't measured those (yet). The benchmark we published measures retrieval quality and token count during search, not overall agent performance. We are working on agent-level evals, but those are unfortunately much harder to get right. However, we do believe that Semble makes agents better based on our own experience of using it for the past months while in development (or at the very least, cheaper).

> We are working on agent-level evals, but those are unfortunately much harder to get right.

It's unfortunately a nearly impossible task, as the models change regularly (without letting you know), so you have a moving (invisible) target that's 1) hard to test exhaustively, and 2) very expensive to test with any low margin of error.

This is why no one does it and just makes broad sweeping unverified claims instead.

If you figure out how to do it... You should probably just get a job at Anthropic or OpenAI and make $2M+ per year...