Sure, go try it and evaluate it rigorously end-to-end, over a sufficient number and variety of tools.
For the purposes of the exercise, let's conservatively say, maybe ~2000 tools covering ~100 major verticals of use cases. Even that may be too narrow for a true general purpose assistant, but it's at least a good start. You can slice the sub-agents however you'd like.
If you can get recall, for real user utterances (not contrived eval utterances authored by your devs and MLEs), over 70% across all the verticals/use cases/tool uses, I'd be extremely impressed. Heck, my thoughts on this won't matter - if you can get the recall for such a system over the bar you'd have cracked something nobody else has and should actively try to sell it to Google for nine figures.
Sure, go try it and evaluate it rigorously end-to-end, over a sufficient number and variety of tools.
For the purposes of the exercise, let's conservatively say, maybe ~2000 tools covering ~100 major verticals of use cases. Even that may be too narrow for a true general purpose assistant, but it's at least a good start. You can slice the sub-agents however you'd like.
If you can get recall, for real user utterances (not contrived eval utterances authored by your devs and MLEs), over 70% across all the verticals/use cases/tool uses, I'd be extremely impressed. Heck, my thoughts on this won't matter - if you can get the recall for such a system over the bar you'd have cracked something nobody else has and should actively try to sell it to Google for nine figures.