Comment by MK_Dev

7 days ago

This is a pretty cool idea and implementation. Any more details on the tech stack you guys are using (besides `browser-use`)?

Thank you! We have a fork of browser-use that lets us hand hold web agents since we know our tasks are repetitive. We can cache expected paths and fire alerts if we go off the rails. We'd love to contribute it back at some point, mainly a question of bandwidth.

We're evaluating Cua (https://www.ycombinator.com/companies/cua) to containerize our agents; am a fan so far. We're also putting Computer Use agents from (OAI and Anthropic) to the test. Many legacy ERPs don't run in the browser and we have to meet them there. I think we're a few months away from things working reliably and efficiently.

We're evaluating several of the top models (both open and closed) for browser navigation (claude's winning atm) and PDF extraction. Since we're performing repetitive tasks, the goal is make our workflows RL-able. Being able to rely on OSS models will help a lot here.

We're building our own data sets and evaluations for many of the subtasks. We're using openai's evals (https://github.com/openai/evals) as a framework to guide our own tooling.

Apart from that, we write in Typescript, Python, and Golang. We use Postgres for persistence (nothing fancy here). We host on AWS, and might go on premises for some customers. We plan on investing a lot into our workflow system as the backbone of our product.

I prefer open source when possible. Everything's new and early, and many things require source changes that others might not be able to prioritize.

Edit - one thing I'd love to find a good solution for is reliably extracting handwriting from PDF documents. Clinicians have to do this a ton to keep the trains running on time, and being able to digitize that knowledge on the go will be huge.

Very open to ideas here. We're seeing great tools and products come up by the day, including from our own YC batch.

  • what made you fork browser-use? what were the missing bits? your use case sounds similar to what they're trying with their new workflow-use repo (I am not affiliated with them, just curious)

    • It's a great repo! We had issues with iframes and framesets (which are old DOM tags) we had to write custom code for. Some DOMs need annotation to provide meaning to an LLM (for example, a button is clearly an "add demographics" button to the human eye, but is ambiguous in the DOM (ul contains li...). Some bottlenecks in navigation required manual attention. We keep those to a minimum. I think the future is being able to progress from highly deterministic JS code, to more agentic LLM-driven decisions. One does need to be able to control this for performance, cost, and accuracy. And yes we have some overlap with workflow-use's direction, but I hope that more such OSS methods gain popularity! It'd simply mean we can go after higher value and more complex clinical tasks!