Comment by AirMax98
19 hours ago
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
19 hours ago
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
pull up any moderately busy picture with more than a trivial amount of objects. pictures of "traffic" or with other similar repetition are great for this demo. pick one specific object (like a specific tire on one car) in the image and write (or say) out all the words youd need to specify that exact object. now take the same image and point at the object with your mouse or circle it with an annotation tool. its often very very hard to describe accurately which object you are talking about, you will often resort to vague "location" words anyway like "on the upper left" that try to define the position in a corse way that requires careful parsing to understand. pointing/annotating is massively superior both in brevity, clarity, and speed.
Nothing new under the sun. "Put that there" demo, 1982.
https://www.media.mit.edu/publications/put-that-there-voice-...
https://www.youtube.com/watch?v=RyBEUyEtxQo
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)