Comment by beering
2 years ago
Even your skeptical take doesn't fully show how faked this was.
> The video has a disclaimer that it was edited for latency.
There was no disclaimer that the prompts were different from what's shown.
> And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.
Look at how many people thought it can react to voice in real-time - the net result is that a lot of people (maybe most?) were deceived. And the text prompts were actually longer and more specific than what was said in the video!
> somebody is pressing a button to submit stills from a video feed, rather than live video.
Somebody hand-picked images to convey exactly the right amount of information to Gemini.
> Does that mean the model takes short video inputs as well? I'm assuming so
It was given a hand-picked series of still images with the hands still on the cups so that it was easier to understand what cup moved where.
Source for the above: https://developers.googleblog.com/2023/12/how-its-made-gemin...
No comments yet
Contribute on Hacker News ↗