← Back to context

Comment by com2kid

15 hours ago

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

You should probably split it up: an end-to-end model for great latency (especially for baked in turn taking), but under the hood it can call out to any old text based model to answer more intricate question. You just need to teach the speech model to stall for a bit, while the LLM is busy.

Just use the same tricks humans are using for that.