Comment by TaylorAlexander

2 years ago

Ah that’s right. I guess my question is, is it a true multimodal model (able to produce arbitrary audio) or is it a speech to text system (OpenAI has a model called Whisper for this) feeding text to the model and then using text to speech to read it aloud.

Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.

Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.