← Back to context

Comment by foobar_______

2 years ago

Agreed. The whole things reeks of being desperate. Half the video is jerking themselves off that they've done AI longer than anyone and they "release" (not actually available in most countries) a model that is only marginally better than the current GPT4 in cherry-picked metrics after nearly a year of lead-time?!?!

That's your response? Ouch.

Have you seen the demo video, it is really impressive and AFAIK OpenAI does not has similar features product offering at the moment, demo or released.

Google essentially claimed a novel approach of native multi-modal LLM unlike OpenAI non-native approach and doing so according to them has the potential to further improve LLM the state-of-the-art.

They have also backup their claims in a paper for the world to see and the results for ultra version of the Gemini are encouraging, only losing in the sentence completion dataset to ChatGPT-4. Remember the new Gemini native multi-modal has just started and it has reached version 1.0. Imagine if it is in version 4 as ChatGPT is now. Competition is always good, does not matter if it is desperate or not, because at the end the users win.

  • If they put the same team on that Gemini video as they do on Pixel promos, you're better off assuming half of it is fake and the other half exaggerated.

    Don't buy into marketing. If it's not in your own hands to judge for yourself, then it might as well be literally science fiction.

    I do agree with you that competition is good and when massive companies compete it's us who win!

    • The hype video should be taken with a grain of salt but the level of capability displayed in the video seems probable in the not too distant future even if Gemini can't currently deliver it. All the technical pieces are there for this to be a reality eventually. Exciting times ahead.

      1 reply →

  • I would like more details on Gemini's 'native' multimodal approach before assuming it is something truly unique. Even if GPT-4V were aligning a pretrained image model and pretrained language model with a projection layer like PaLM-E/LLaVA/MiniGPT-4 (unconfirmed speculation, but likely), it's not as if they are not 'natively' training the composite system of projection-aligned models.

    There is nothing in any of Google's claims that preclude the architecture being the same kind of composite system. Maybe with some additional blending in of multimodal training earlier in the process than has been published so far. And perhaps also unlike GPT-4V, they might have aligned a pretrained audio model to eliminate the need for a separate speech recognition layer and possibly solving for multi-speaker recognition by voice characteristics, but they didn't even demo that... Even this would not be groundbreaking though. ImageBind from Meta demonstrated the capacity align an audio model with an LLM in the same way images models have been aligned with LLMs. I would perhaps even argue that Google skipping the natural language intermediate step between LLM output and image generation is actually in support of the position that they may be using projection layers to create interfaces between these modalities. However, this direct image generation projection example was also a capability published by Meta with ImageBind.

    What seems more likely, and not entirely unimpressive, is that they refined those existing techniques for building composite multimodal systems and created something that they plan to launch soon. However, they still have crucially not actually launched it here. Which puts them in a similar position to when GPT-4 was first announced with vision capabilities, but then did not offer them as a service for quite an extended time. Google has yet to ship it, and as a result fails to back up any of their interesting claims with evidence.

    Most of Google's demos here are possible with a clever interface layer to GPT-4V + Whisper today. And while the demos 'feel' more natural, there is no claim being made that they are real-time demos, so we don't know how much practical improvement in the interface and user experience would actually be possible in their product when compared to what is possible with clever combinations of GPT-4V + Whisper today.

  • If what they're competing with is other unreleased products, then they'll have to compete with OpenAI's thing that made all its researchers crap their pants.

  • What makes it native?

    • Good question.

      Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

      But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.

      2 replies →

I’m impressed that it’s multimodal and includes audio. GPT-4V doesn’t include audio afaik.

Also I guess I don’t see it as critical that it’s a big leap. It’s more like “That’s a nice model you came up with, you must have worked real hard on it. Oh look, my team can do that too.”

Good for recruiting too. You can work on world class AI at an org that is stable and reliable.

  • https://openai.com/blog/chatgpt-can-now-see-hear-and-speak

    I think it's app only though

    • That’s different. It’s essentially using whisper model for audio to text and that inputs to ChatGPT.

      Multimodal would be watching YouTube without captions and asking “how did a certain character know it was raining outside?” Based on rain sound but no image of rain

      2 replies →

    • Ah that’s right. I guess my question is, is it a true multimodal model (able to produce arbitrary audio) or is it a speech to text system (OpenAI has a model called Whisper for this) feeding text to the model and then using text to speech to read it aloud.

      Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.

      Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.

      2 replies →

  • Google is stable and reliable?

    • They can certainly pretend they are for hiring purposes. Compared to a company that fired their CEO, nearly had the whole company walk out, then saw the board ousted and the CEO restored google does look more reliable.

      Just don’t speak to xooglers about it. ;)

      2 replies →

I worked at Google up through 8 weeks ago and knew there _had_ to be a trick --

You know those stats they're quoting for beating GPT-4 and humans? (both are barely beaten)

They're doing K = 32 chain of thought. That means running an _entire self-talk conversation 32 times_.

Source: https://storage.googleapis.com/deepmind-media/gemini/gemini_..., section 5.1.1 paragraph 2

  • How do you know GPT-4 is 1 shot? The details about it aren't released, it is entirely possible it does stuff in multiple stages. Why wouldn't OpenAI use their most powerful version to get better stats, especially when they don't say how they got it?

    Google being more open here about what they do is in their favor.

    • There's a rumour that GPT-4 runs every query either 8x or 16x in parallel, and then picks the "best" answer using an additional AI that is trained for that purpose.

      8 replies →

    • Same way I know the latest BMW isn't running on a lil nuke reactor. I don't, technically. But there's not enough comment room for me to write out the 1000 things that clearly indicate it. It's a "not even wrong" question on your part

  • where are you seeing that 32-shot vs 1-shot comparison drawn? in the pdf you linked it seems like they run it various times using the same technique on both models and just pick the technique which gemini most wins using.

This reminds me of their last AI launch. When Bard came out, it wasn't available in EU for weeks (months?). When it finally arrived, it was worse than GPT-3.

Google are masters at jerking themselves off. I mean come on... "Gemini era"? "Improving billions of people’s lives"? Tone it down a bit.

It screams desperation to be seen as ahead of OpenAI.

  • Google has billions of users whose lives are improved by their products. What is far fetched about this AI improving those product lines?

    Sounds like it's you that needs to calm down a bit. God forbid we get some competition.

    • It's just arrogant to name an era after your own product you haven't even released yet. Let it speak for itself. ChatGPT's release was far more humble and didn't need hyping up to be successful.

    • If you mean with "lives improved" that they have locked people into their products, spying on them, profiling them, making us into their product for them to make lots of money from, yeah, you're totally right.

    • Im a user, no way my life was improved! If anything they made me sad and miserable. It gives all these nice looking things that you really should not use, you want to but cant, you make the mistake anyway, it is nice for a while then taken away.

      It would be funny if it only happened 10 or 20 times.

      3 replies →