Comment by crazygringo

2 years ago

Does it matter at all with regards to its AI capabilities though?

The video has a disclaimer that it was edited for latency.

And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.

So then it seems like somebody is pressing a button to submit stills from a video feed, rather than live video. It's still just as useful.

My main question then is about the cup game, because that absolutely requires video. Does that mean the model takes short video inputs as well? I'm assuming so, and that it generates audio outputs for the music sections as well. If those things are not real, then I think there's a problem here. The Bloomberg article doesn't mention those, though.

Even your skeptical take doesn't fully show how faked this was.

> The video has a disclaimer that it was edited for latency.

There was no disclaimer that the prompts were different from what's shown.

> And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.

Look at how many people thought it can react to voice in real-time - the net result is that a lot of people (maybe most?) were deceived. And the text prompts were actually longer and more specific than what was said in the video!

> somebody is pressing a button to submit stills from a video feed, rather than live video.

Somebody hand-picked images to convey exactly the right amount of information to Gemini.

> Does that mean the model takes short video inputs as well? I'm assuming so

It was given a hand-picked series of still images with the hands still on the cups so that it was easier to understand what cup moved where.

Source for the above: https://developers.googleblog.com/2023/12/how-its-made-gemin...

I'm ok with "edited for latency" or "only showing the golden path".

But the most impressive part of the demo, was the way the LLM just seemed to know when to jump in with a response. It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway though a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue.

The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.

I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural.

As you said, good speed-to-text and speech-to-text has already been done, along with multi-model image/video/audio LLMs and image/music generation. The only novel thing google appeared to be demonstrating and what was most impressive was this apparent natural interaction. But that part was all fake.

Audio input that's not text in the middle and video input are two things they made a big deal out of. Then they called it a hands on demo and it was faked.

> My main question then is about the cup game, because that absolutely requires video.

They did it with carefully timed images, and provided a few examples first.

> I'm assuming so, and that it generates audio outputs for the music sections as well

No, it was given the ability to search for music and so it was just generating search terms.

Here's more details:

https://developers.googleblog.com/2023/12/how-its-made-gemin...