Comment by TaylorAlexander
2 years ago
Ah that’s right. I guess my question is, is it a true multimodal model (able to produce arbitrary audio) or is it a speech to text system (OpenAI has a model called Whisper for this) feeding text to the model and then using text to speech to read it aloud.
Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.
Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.
One of the demo videos explicitly addresses this point: https://youtu.be/D64QD7Swr3s?si=_bBa9aPmqGbo-Iej
Oh that’s actually pretty good then. It also seems it does output audio despite the PDF from google I was reading saying otherwise. Hmm.