← Back to context

Comment by beering

2 years ago

Even your skeptical take doesn't fully show how faked this was.

> The video has a disclaimer that it was edited for latency.

There was no disclaimer that the prompts were different from what's shown.

> And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.

Look at how many people thought it can react to voice in real-time - the net result is that a lot of people (maybe most?) were deceived. And the text prompts were actually longer and more specific than what was said in the video!

> somebody is pressing a button to submit stills from a video feed, rather than live video.

Somebody hand-picked images to convey exactly the right amount of information to Gemini.

> Does that mean the model takes short video inputs as well? I'm assuming so

It was given a hand-picked series of still images with the hands still on the cups so that it was easier to understand what cup moved where.

Source for the above: https://developers.googleblog.com/2023/12/how-its-made-gemin...