Comment by why_at
16 hours ago
My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
The killer app was conceived as early as the 1980s: an agent running on your computer, organizing your files, your schedule, your messages, your bills, bank accounts, etc. All the parts of your life that were routine drudgery should be able to be offloaded to a smart agent, based on your preference, to bring you the information you needed with natural language queries, contextualized to what you were doing at the time, when you need it.
What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.
A few decades back, a lot of computer use was emails. And it was stored on someone else's servers - with everyone from server operators along the route, to the government potentially having access to it. Even HTTPS is a relatively recent thing.
I guess what I'm saying is - we've always had this problem.
Yea there have always been gaps in privacy, but nowadays it's several orders of magnitude easier for corporations to exploit that private data at scale.
The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).
I've been dreaming about that for 20 years. And then use it for people to communicate while sleeping.
Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
It seems that if we ultimately want to "move at the speed of thought," it will require speech.
> It seems that if we ultimately want to "move at the speed of thought," it will require speech.
Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.
A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.
Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking
3 replies →
There's the adage that writing is thinking, but even more accurately at least for me, editing is thinking.
Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.
Though there may be some hybrid approach that can work well.
1 reply →
Yes, it does seem kinda ... pointless.
You should look into how often people are using tools like WisprFlow and SuperWhisper. Voice is a very native mechanism. Most people working in open floor plans are wearing headphones any way. As long as you're not screaming, it's probably fine. Maybe, we'll move away from open plan offices in the bid for efficiency, which I would welcome.
I am moving full remote because dictation is such a better input mechanism for most of my AI interactions that I have become less efficient sitting in my open floorplan desk at the office because I cannot dictate there and the latency adds up. Typing is just achingly slow these days.
I feel like I can type faster than I can talk but I could be totally wrong?
3 replies →
You should look into how often people are using rectangles with buttons on them. They may be a bit archaic, but they are my preferred input method. For example, thanks to rectangles with buttons, the other people in my vicinity do not need to hear about the inane internet arguments I routinely involve myself in.
I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.
And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.
I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".
I'm sorry but if you think the amount of workers using voice controls in the office to be more than 1% you are in a massive bubble my dude.
Sorry to Bother You.
https://www.youtube.com/watch?v=XthLQZWIshQ
https://en.wikipedia.org/wiki/Sorry_to_Bother_You
Yeah, I'd hate to use this in an open-plan office (which is like 99% of offices these days) and even using it alone at home would feel awkward. I don't really want to talk to the computer despite what 1950's sci-fi books led us to believe.
It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.
The only place I'd ever talk to a machine is my car. Instead of huge flashy screens that distracts and kills thousands of people maybe they could build a buttons + voice agent system that could actually be useful and durable. I hate to tap Waze/Maps/etc. every time when I go somewhere or that I cannot comfortably switch to specific songs en route without risking my life...
I connect my iPhone to my car and it requires Siri to be enabled which I can then use to change songs, Google Maps destinations etc. without having to touch anything.
The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.
It's possible to rely on mouth movements instead of sound. I've been tweaking visual speech recognition models (VSR) for the past few weeks so that I can "talk" to my agents at the office without pissing everyone off. It works okay. Limiting language to "move this" "clear that" along side context cues vastly simplifies the problem and makes it far more possible on device.
I think its brilliant UX.
No UX needs to be perfect for everyone, but this doesn't sound trivial to make reliable.
First things that came to mind:
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?
>non english languages (god forbid bilingualism)
In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.
Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk
[π] https://en.wikipedia.org/wiki/Danish_language#Dialects
>Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Reads like the argument against cell phones where don't have a cabinet around you...
I wouldn't sit in the office talking on my phone next to my colleagues, that would be really annoying.
I'd go and find a small meeting room or conference call booth in the office and take it there.
Essentially, a cabinet.
The argument is against human to machine control. Not human to human communication.
In fact, when humans happen to order other humans, it's typically done in writing.
A General-Purpose Bubble Cursor
https://www.youtube.com/watch?v=46EopD_2K_4
>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.
Transcript:
>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]
https://www.dgp.toronto.edu/~ravin/papers/chi2005_bubblecurs...
>The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area
>Tovi Grossman, Ravin Balakrishnan; Department of Computer Science; University of Toronto
I've written more about Morgan Dixon's work on Prefab (pre-LLM pattern recognition, which is much more relevent with LLMs now).
https://news.ycombinator.com/item?id=29105919
The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.
That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
pull up any moderately busy picture with more than a trivial amount of objects. pictures of "traffic" or with other similar repetition are great for this demo. pick one specific object (like a specific tire on one car) in the image and write (or say) out all the words youd need to specify that exact object. now take the same image and point at the object with your mouse or circle it with an annotation tool. its often very very hard to describe accurately which object you are talking about, you will often resort to vague "location" words anyway like "on the upper left" that try to define the position in a corse way that requires careful parsing to understand. pointing/annotating is massively superior both in brevity, clarity, and speed.
Nothing new under the sun. "Put that there" demo, 1982.
https://www.media.mit.edu/publications/put-that-there-voice-...
https://www.youtube.com/watch?v=RyBEUyEtxQo
Yup - what google is suggesting here will never materialize beyond being a slopfeature. People who want these bespoke workflows will build them or seek out specific tools that enable them, not trusting some overarching daemon that contextually watches their cursor. I don't trust google one bit to execute correctly on something like this.
Well you see to really, really sell it to the common folks, they need to convince you that ChatBots are the "Intelligence" . So they are coming up with all sorts of crap, like this one. The TV advertisements for Gemini and co. are indicative of how they see the average user, as an idiot of sorts, who needs the shit-device for pretty much anything. Oh you spilled some water on the counter top? Quick, ask Gemini what to do! You are a 20ish something individual home alone? Quick, lay on the couch and ask Gemini if you can really talk to it, omg, its so exciting! You were in holidays all alone, but in the middle of a really large crowd? Gemini to the help, cut those people out and make it look like it was an exclusive spot, just for you! Nobody else was there. So this proposal is going into the same direction - probably targeting the average office "idiot".