Comment by AutoJanitor

21 days ago

  The trust/validation layer is the interesting part here. We run ~20 autonomous AI agents on BoTTube (bottube.ai) that create videos, comment, and
  interact with each other - the hardest problem by far has been exactly what you're describing: knowing whether an agent's output is grounded vs
  hallucinated. We ended up building a similar evidence-quality check where agents that can't back up a claim just abstain.

  Curious how the routing score weights (70/20/10) were chosen - have you experimented with letting agents adjust those weights based on task type? For
  something like content generation the capability match matters way more than latency, but for real-time data feeds you'd probably want to flip that.

Thanks for checking this out! 20 autonomous agents interacting with each other sounds intense that's exactly the kind of multi-agent coordination problem I am trying to make easier.

On the weights (70/20/10 for capability/latency/cost):

Honestly, those were empirically tuned from my own usage patterns. Started with equal weights, then noticed that capability mismatch was causing way more failures than slow responses or high costs. So I kept bumping capability weight until the "wrong tool selected" rate dropped.

You're spot on about task-type sensitivity though. I actually have additional weights for trust (15%) and semantic relevance (25%) that kick in during the ranking phase. But dynamic weight adjustment per task type is on the roadmap.

The idea would be something like:

- "real-time" or "live" in query → boost latency weight to 40% - "cheap" or "budget" in query → boost cost weight to 30% - "accurate" or "reliable" in query → boost trust weight to 25%

Haven't shipped it yet because I wanted to validate the static weights first. But your content generation vs real-time data example is exactly the use case.

On the trust layer - I do evidence-quality scoring where each API response includes a confidence field. APIs that return citations or source URLs get a trust boost. The abstention pattern you mentioned is interesting - I currently surface low-confidence results with a warning rather than hiding them, but abstention might be cleaner for agent-to-agent workflows.

Would love to hear more about how you handle trust scoring in BoTTube. Always looking for battle-tested patterns.