← Back to context

Comment by jjk7

5 days ago

Probably accessibility APIs

Which specific ones though allow you to send input to a window without raising it? People have been trying to do "focus follows mouse [without auto raise]" for a long time on mac, and the synthetic event equivalent to command+click is the only discovered method I'm aware of, e.g. used in https://github.com/sbmpost/AutoRaise

There is also this old blog post by Yegge [1] which mentions `AXUIElementPostKeyboardEvent` but there were plenty of bugs with that, and I haven't seen anyone else build on it. I guess the modern equivalent is `CGEventPostToPSN`/`CGEventPostToPid`. I guess it's a good candidate though, perhaps the Sky team they acquired knows the right private APIs to use to get this working.

Edit: The thread at [2] also has some interesting tidbits, such as Automator.app having "Watch Me Do" which can also do this, and a CLI tool that claims to use the CGEventPostToPid API [3]. Maybe there's more ways to do it than I realized.

[1] https://steve-yegge.blogspot.com/2008/04/settling-osx-focus-... [2] https://www.macscripter.net/t/keystroke-to-background-app-as... [3] https://github.com/socsieng/sendkeys

  • You don't actually need to send CGEvents to UI elements to make them do things ;)

    • Could you elaborate on what you mean? My understanding of the Cocoa event loop was that ultimately everything is received as an NSEvent at the application layer (maybe that's wrong though).

      Do you mean that you can just AXUIElementPerformAction once you have a reference to it and the OS will internally synthesize the right type of event, even if it's not in the foreground?

      2 replies →

  • Maybe they used Claude to come up with a good method to do this. /s

    But I was also wondering, how this even works. The AI agent can have its own cursors and none of its actions interrupt my own workflow at all? Maybe I need to try this.

    Also, this sounds like it would be very expensive since from my understanding each app frame needs to be analysed as an image first, which is pretty token intensive.