Comment by vessenes

1 year ago

This is so, so good. I like that it seems to be a teaser app for cerebrium, if I understand it. It has good killer app potential. My tests from iPad ranged from 1400ms to 400ms reported latency; in the low end, it felt very fluid.

One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.

Humans work like this; we use lots of filler words as we sort of get going responding to things.

Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].

Very cool, thank you for sharing this.

Hi Vessenes

From Cerebrium here. Really appreciate the feedback - glad you had a good experience!

This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc

In partnership with Daily, we really wanted to focus on the engineer here. So make it extremely flexible for them to edit the application to suit their use case/preference while at the same time take away the mundane infrastructure setup.

You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents

  • Thanks for this reply. Yep, as an engineer, this is awesome, the docs look simple and I’ll give it a whirl. As a product guy, it seems like it would be dead simple to start a company on this tech by just putting up a web page that lets people pick a couple choices and gives them a custom domain. Very cool!

I've wondered about this as well. Is there a way to have a small, efficient LLM model that can estimate general task complexity without actually running the full task workload?

Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.

  • For sure: in fact MoE models train such a router directly, and the routers are not super large. But it would also be easy to run phi-3 against a request.

    I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’

    There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.