Comment by totetsu

12 hours ago

What is the difference between Flux’s end-of-turn detection and Openai's Automatic turn detection Semantic mode?

2 comments

totetsu

In OpenAI's own words about semantic_vad:

> Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.

Source: https://developers.openai.com/api/docs/guides/realtime-vad

OpenAI's Semantic mode is looking at the semantic meaning of the transcribed text to make an educated guess about where the user's end of utterance is.

According to Deepgram, Flux's end-of-turn detection is not just a semantic VAD (which inherently is a separate model from the STT model that's doing the transcribing). Deepgram describes Flux as:

> the same model that produces transcripts is also responsible for modeling conversational flow and turn detection.

[...]

> With complete semantic, acoustic, and full-turn context in a fused model, Flux is able to very accurately detect turn ends and avoid the premature interruptions common with traditional approaches.

Source: https://deepgram.com/learn/introducing-flux-conversational-s...

So according to them, end-of-turn detection isn't just based on semantic content of the transcript (which makes sense given the latency), but rather the the characteristics of the actual audio waveform itself as well.

Which Pipecat (open source voice AI orchestration platform) actually does as well seemingly with their smart-turn native turn detection model as well: https://github.com/pipecat-ai/smart-turn (minus the built-in transcription)

totetsu 4 hours ago

Thanks. Then maybe it’s similar to Moshi https://github.com/kyutai-labs/moshi?tab=readme-ov-file