← Back to context

Comment by potatolicious

12 hours ago

> "They have all data but can’t seem to get an llm that can set an alarm and be a chatbot at the same time?"

This is actually one of the hardest frontier problems. The "general purpose" assistant is one of the singular hardest technical problems with LLMs (or any kind of NLP).

I think people are easily snowed by LLMs' apparent linguistic fluency that they impute that to capability. This cannot be further from the truth.

In reality a LLM presented with a vast array of tools has extremely poor reliability, so if you want a thing that can order delivery and remember your shopping list and remind you of your flight and play music you're radically exceeding the capabilities of current models. There's a reason successful (anything that isn't demoware/vaporware) uses of agentic LLMs tend to narrow-domain use cases.

There's a reason Google hasn't done it either, and indeed nor has anyone else: neither Anthropic nor OpenAI have a general purpose assistant (defined as being able to execute an indefinite number of arbitrary tools to do things for you, as opposed to merely converse with you).

You split up the tasks into sub agents. This is something my company builds on top of langgraph.

  • Sure, go try it and evaluate it rigorously end-to-end, over a sufficient number and variety of tools.

    For the purposes of the exercise, let's conservatively say, maybe ~2000 tools covering ~100 major verticals of use cases. Even that may be too narrow for a true general purpose assistant, but it's at least a good start. You can slice the sub-agents however you'd like.

    If you can get recall, for real user utterances (not contrived eval utterances authored by your devs and MLEs), over 70% across all the verticals/use cases/tool uses, I'd be extremely impressed. Heck, my thoughts on this won't matter - if you can get the recall for such a system over the bar you'd have cracked something nobody else has and should actively try to sell it to Google for nine figures.