It can be realtime while still having more latency than depicted in the video (and the video clearly stated that Gemini does not respond that quickly).
A local model could send relevant still images from the camera feed to Gemini, along with the text transcript of the user’s speech. Then Gemini’s output could be read aloud with text-to-speech. Seems doable within the present cost and performance constraints.
It can be realtime while still having more latency than depicted in the video (and the video clearly stated that Gemini does not respond that quickly).
A local model could send relevant still images from the camera feed to Gemini, along with the text transcript of the user’s speech. Then Gemini’s output could be read aloud with text-to-speech. Seems doable within the present cost and performance constraints.