Comment by readyplayeremma
2 years ago
I would like more details on Gemini's 'native' multimodal approach before assuming it is something truly unique. Even if GPT-4V were aligning a pretrained image model and pretrained language model with a projection layer like PaLM-E/LLaVA/MiniGPT-4 (unconfirmed speculation, but likely), it's not as if they are not 'natively' training the composite system of projection-aligned models.
There is nothing in any of Google's claims that preclude the architecture being the same kind of composite system. Maybe with some additional blending in of multimodal training earlier in the process than has been published so far. And perhaps also unlike GPT-4V, they might have aligned a pretrained audio model to eliminate the need for a separate speech recognition layer and possibly solving for multi-speaker recognition by voice characteristics, but they didn't even demo that... Even this would not be groundbreaking though. ImageBind from Meta demonstrated the capacity align an audio model with an LLM in the same way images models have been aligned with LLMs. I would perhaps even argue that Google skipping the natural language intermediate step between LLM output and image generation is actually in support of the position that they may be using projection layers to create interfaces between these modalities. However, this direct image generation projection example was also a capability published by Meta with ImageBind.
What seems more likely, and not entirely unimpressive, is that they refined those existing techniques for building composite multimodal systems and created something that they plan to launch soon. However, they still have crucially not actually launched it here. Which puts them in a similar position to when GPT-4 was first announced with vision capabilities, but then did not offer them as a service for quite an extended time. Google has yet to ship it, and as a result fails to back up any of their interesting claims with evidence.
Most of Google's demos here are possible with a clever interface layer to GPT-4V + Whisper today. And while the demos 'feel' more natural, there is no claim being made that they are real-time demos, so we don't know how much practical improvement in the interface and user experience would actually be possible in their product when compared to what is possible with clever combinations of GPT-4V + Whisper today.
No comments yet
Contribute on Hacker News ↗