Comment by mertunsall
13 hours ago
In browser-use, we combine vision + browser extraction and we find that this gives the most reliable agent: https://github.com/browser-use/browser-use :)
We recently gave the model access to a file system so that it never forgets what it's supposed to do - we already have ton of users very happy with recent reliability updates!
We also have a beta workflow-use, which is basically what's mentioned in the comments here to "cache" a workflow: https://github.com/browser-use/workflow-use
Let us know what you guys think - we are shipping hard and fast!
So there’s a very big difference in the sort of vision approach that browser-use does vs. what we do
browser-use is still strongly coupled to the DOM for interaction because of the set-of-marks approach it uses (for context - those little rainbow boxes you see around the elements). This means it’s very difficult to get it to reliably do interactions outside of straightforward click/type like drag and drop, interacting with canvas, etc.
Since we interact based purely on what we see on the screen using pixel coordinates, those sort of interactions are a lot more natural to us and perform much more reliably. If you don't believe me, I encourage you to try to get both Magnitude and browser-use to drag and drop cards on a Kanban board :)
Regardless, best of luck!