Comment by arcb

15 hours ago

It's a great repo! We had issues with iframes and framesets (which are old DOM tags) we had to write custom code for. Some DOMs need annotation to provide meaning to an LLM (for example, a button is clearly an "add demographics" button to the human eye, but is ambiguous in the DOM (ul contains li...). Some bottlenecks in navigation required manual attention. We keep those to a minimum. I think the future is being able to progress from highly deterministic JS code, to more agentic LLM-driven decisions. One does need to be able to control this for performance, cost, and accuracy. And yes we have some overlap with workflow-use's direction, but I hope that more such OSS methods gain popularity! It'd simply mean we can go after higher value and more complex clinical tasks!

Did you consider working around those using the vision models vs DOM parsing? Was cost/latency the bottleneck? Seems like the agentic future you describe would need more vision based parsing