Comment by potatolicious

17 hours ago

Sure, go try it and evaluate it rigorously end-to-end, over a sufficient number and variety of tools.

For the purposes of the exercise, let's conservatively say, maybe ~2000 tools covering ~100 major verticals of use cases. Even that may be too narrow for a true general purpose assistant, but it's at least a good start. You can slice the sub-agents however you'd like.

If you can get recall, for real user utterances (not contrived eval utterances authored by your devs and MLEs), over 70% across all the verticals/use cases/tool uses, I'd be extremely impressed. Heck, my thoughts on this won't matter - if you can get the recall for such a system over the bar you'd have cracked something nobody else has and should actively try to sell it to Google for nine figures.

1 comment

potatolicious

rsanheim 8 hours ago

Yeah, it turns out many nerds don't consider the fact that the amazing tools we are using to do constrained tasks aren't that great for more general purpose things. Writing a spike, spitting out unit tests, or vibe coding a front end feature is not the same as planning a trip to europe, balancing accounts, or managing a schedule.

So much attention, effort, and tooling has focused on getting llms better at writing more and more code. They can grep and curl and run scripts and iterate and build things really fast, and maybe even maintain it if given enough guardrails and direction.

But it turns out we have had a _ton_ of useful training data for models to work with for software. Not just books or docs, but examples, tests, snippets and full programs for just about any language. Show me a stackoverflow with playwright scripts or API calls (hah, as if thats possible) to build itineraries from delta, aa, united, priceline, expedia, etc, .... which is one part of one piece of the ai assistant pipe-dream.

I don't think its impossible as these tools get much smarter and more generally capable that we get decent assistants in other constrained, non-software domains, but it will take very good companies focusing on it for a long time. Much like any product that try to do these sorts of things.

Its so easy for programmers in our bubble to overlook the complexity involved in automating or even _describing_ simple tasks that humans navigate everyday via habit, learning, experience, and perception...all things that llms struggle with constantly.