Comment by kh9000

5 months ago

Using the UIA tree as the currency for LLMs to reason over always made more sense to me than computer vision, screenshot based approaches. It’s true that not all software exposes itself correctly via UIA, but almost all the important stuff does. VS code is one notable exception (but you can turn on accessibility support in the settings)

10 comments

kh9000

philipbjorge 5 months ago

Important is subjective — In the healthcare space, I’d make the claim that most applications don’t expose themselves correctly (native or web).

CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.

(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)

phatskat 5 months ago

I’ve been working hard on our new component implementation (Vue/TS) to include accessibility for components that aren’t just native reskins, like combo and list boxes, and keyboard interactivity is a real pain. One of my engineers had it half-working on her dropdown and threw in the towel for MVP because there’s a lot of little state edge cases to watch out for.
Thankfully the spec as provided by MDN for minimal functionality is well spelled out and our company values meeting accessibility requirements, so we will revisit and flesh out what we’re missing.
Also I wanna give props (ha) to the Storybook team for bringing accessibility testing into their ecosystem as it really does help to have something checking against our implementations.

freedomben 5 months ago

Agreed. I've noticed ChatGPT when parsing screenshots writes out some Python code to parse it, and at least in the tests I've done (with things like, "what is the RGB value of the bullet points in the list" or similar) it ends up writing and rewriting the script five or so times and then gives up. I haven't tried others so I don't know if their approach is unique or not, but it definitely feels really fragile and slow to me

Juminuvi 5 months ago
I noticed something similar. I asked it extract a guid from an image and it wrote a python script to run ocr against it...and got it wrong. Prompting a bit more seemed to finally trigger it to use it's native image analysis but I'm not sure what the trick was.
- spacebacon 5 months ago
  
  Probably just ask it to use native image analysis versus writing code. I have done this before extracting usernames from screenshots.
- morkalork 5 months ago
  
  I've run into this with uploading audio and text files, have to yell at it to not write any code and use it's native abilities to do the job.

nikanj 5 months ago

Most Electron software doesn't follow accessibility guidelines and exposes nothing over UIA

akurilin 5 months ago

I recently tried using Qwen VL or Moondream to see if off-the-shelf they would be able to accurately detect most of the interesting UI elements on the screen, either in the browser or your average desktop app.

It was a somewhat naive attempt, but it didn't look like they performed well without perhaps much additional work. I wonder if there are models that do much better, maybe whatever OpenAI uses internally for operator, but I'm not clear how bulletproof that one is either.

These models weren't trained specifically for UI object detection and grounding, so, it's plausible that if they were trained on just UI long enough, they would actually be quite good. Curious if others have insight into this.