← Back to context

Comment by hackgician

2 days ago

accessibility (a11y) trees are super helpful for LLMs; we use them extensively in stagehand! the context is nice for browsers, since you have existing frameworks like selenium/playwright/puppeteer for actually acting on nodes in the a11y tree.

what does that analog look like in more traditional computer use?

There are a variety of accessibility frameworks from MSAA (old, windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an abstraction over these APIs to standardize gathering roles and other information from the matrix of a11y providers, though note the GPL license depending on how you want to use it.

  • Our experience working with A11y apis like above is that data is frequently missing, and the APIs can be shockingly slow to read from. The highest performing agents in WindowsArena use a mixture of A11y and yolo-like grounding models such as Omniparser, with A11y seeming shifting out of vogue in favor of computer vision, due to it giving incomplete context.

    Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.

another currently unanswered question in Muscle Mem is how to more cleanly express the targeting of named entities.

Currently, a user would have to explicitly have their @engine.tool call take an element ID as an argument to a click_element_by_name(id) in order for it to be reused. This works, but for Muscle Mem would lead to the codebase getting littered with hyper-specific functions that are there just to differentiate tools for Muscle Mem, which goes against the agent agnostic thesis of the project.

Still figuring out how to do this.