Comment by lolpanda

7 months ago

Great idea! The demo looks impressive. What are your thoughts on real-time translated captioning compared to AI voice? I guess it's still difficult to mimic nonverbal elements like laughter and pauses.

Fantastic question. Our opinion on this is that the higher-bandwidth we can make the communication, the more useful it will be. The reason we've moved from IRC->VoIP->Video is because of the efficiency of information transfer and additionally the empathic element of face-to-face conversation.

From the technical side, speech to speech models have more potential for accuracy (no explicit ASR, no audio->text information loss). We have a few options on mimic'ing nonverbal elements - we could decide when to naturally mix in the original audio, or train our end to end model to handle those nonverbal audio chunks. We'll be trying both but likely the first option on the sooner side!