← Back to context

Comment by fbouvier

9 days ago

Hey Anirudh, Stagehand looks awesome, congrats. Really love the focus on making browser automations more resilient to DOM changes. The act, extract, and observe methods are super clean.

You might want to check out Lightpanda (https://github.com/lightpanda-io/browser). It's an open-source, lightweight headless browser built from scratch for AI and web automation. It's focused on skipping graphical rendering to make it faster and lighter than Chrome headless.

I don't really follow: a lot of the fragility of web automation comes from the programmatic vs. visual differences, which VLMs are able to overcome. Skipping the graphical rendering seems to be committing yourself to non-visual hell.

The web isn't made for agents and automation. It's made for people.

  • Yes and no. Getting a VLM to work on the web would definitely be great, but it comes with its own problems, mainly around developing and acting on bounding boxes. We have vision as a default fallback for Stagehand, but we've found that the screenshot sent to the VLM often has to have pre-labeled elements on it. More notably, the screenshot with everything prelabeled leads to a cluttered and unusable image to process. Not pre-labeling runs the risk of missing important elements. I imagine a happy medium where the DOM+a11y tree can be used for candidate generation to a VLM.

    Solely depending on a VLM is indeed reminiscent of how humans interact with the web, but when a model thrives with more data, why restrict the data sent to the model?

Lightpanda does look promising, but this is an important note from the readme: " You should expect most websites to fail or crash."

  • You're absolutely right, the 'most websites will fail' note is there because we're still in development, and the browser doesn't yet handle the long tail of web APIs.

    That said, the architecture's coming together and the performance gains we're seeing make us excited about what's possible as we keep building. Feedback is very welcome, especially on what APIs you'd like to see us prioritize for specific workflows and use cases.