Comment by teleforce

2 years ago

Have you seen the demo video, it is really impressive and AFAIK OpenAI does not has similar features product offering at the moment, demo or released.

Google essentially claimed a novel approach of native multi-modal LLM unlike OpenAI non-native approach and doing so according to them has the potential to further improve LLM the state-of-the-art.

They have also backup their claims in a paper for the world to see and the results for ultra version of the Gemini are encouraging, only losing in the sentence completion dataset to ChatGPT-4. Remember the new Gemini native multi-modal has just started and it has reached version 1.0. Imagine if it is in version 4 as ChatGPT is now. Competition is always good, does not matter if it is desperate or not, because at the end the users win.

If they put the same team on that Gemini video as they do on Pixel promos, you're better off assuming half of it is fake and the other half exaggerated.

Don't buy into marketing. If it's not in your own hands to judge for yourself, then it might as well be literally science fiction.

I do agree with you that competition is good and when massive companies compete it's us who win!

  • The hype video should be taken with a grain of salt but the level of capability displayed in the video seems probable in the not too distant future even if Gemini can't currently deliver it. All the technical pieces are there for this to be a reality eventually. Exciting times ahead.

I would like more details on Gemini's 'native' multimodal approach before assuming it is something truly unique. Even if GPT-4V were aligning a pretrained image model and pretrained language model with a projection layer like PaLM-E/LLaVA/MiniGPT-4 (unconfirmed speculation, but likely), it's not as if they are not 'natively' training the composite system of projection-aligned models.

There is nothing in any of Google's claims that preclude the architecture being the same kind of composite system. Maybe with some additional blending in of multimodal training earlier in the process than has been published so far. And perhaps also unlike GPT-4V, they might have aligned a pretrained audio model to eliminate the need for a separate speech recognition layer and possibly solving for multi-speaker recognition by voice characteristics, but they didn't even demo that... Even this would not be groundbreaking though. ImageBind from Meta demonstrated the capacity align an audio model with an LLM in the same way images models have been aligned with LLMs. I would perhaps even argue that Google skipping the natural language intermediate step between LLM output and image generation is actually in support of the position that they may be using projection layers to create interfaces between these modalities. However, this direct image generation projection example was also a capability published by Meta with ImageBind.

What seems more likely, and not entirely unimpressive, is that they refined those existing techniques for building composite multimodal systems and created something that they plan to launch soon. However, they still have crucially not actually launched it here. Which puts them in a similar position to when GPT-4 was first announced with vision capabilities, but then did not offer them as a service for quite an extended time. Google has yet to ship it, and as a result fails to back up any of their interesting claims with evidence.

Most of Google's demos here are possible with a clever interface layer to GPT-4V + Whisper today. And while the demos 'feel' more natural, there is no claim being made that they are real-time demos, so we don't know how much practical improvement in the interface and user experience would actually be possible in their product when compared to what is possible with clever combinations of GPT-4V + Whisper today.

If what they're competing with is other unreleased products, then they'll have to compete with OpenAI's thing that made all its researchers crap their pants.

What makes it native?

  • Good question.

    Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

    But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.

    • It's highly unlikely for a generative model to be able to reason about language in this level when based on audio features alone. Gemini may use audio cues, but text tokens must be fed in the very early layers of the transformer for complex reasoning to be possible. But because the gemini paper only mentions a transformer architecture, I don't see a way for them to implement speech-to-text inside such architecture (while also allowing direct text input). Maybe native here means that such a stack of models was rather trained together.

      1 reply →