← Back to context

Comment by wonger_

4 days ago

What is the benefit of porting all those tools to extensions? Have you ran into any other extension-based challenges besides lifecycles and race conditions?

Some benefits (without using Chrome.debugger or Chrome DevTools Protocol):

1. There are 3,500,000,000 instances of Chrome desktop being used. [0]

2. A Chrome Extension can be installed with a click from the Chrome Web Store.

3. It is closer to the metal so runs extremely fast.

4. Can run completely contained on the users machine

5. It's just one user automating their web based workflows making it harder for bot protections to stop and with a human-in-the-loop any hang ups and snags can be solved by the human

6. Chrome extensions now have a side panel that is stationary in the window during navigation and tab switching. It is exactly like using the Cursor or VSCode side panel copilots

Some limitations:

1. Can't automate ChatGPT console because they check for user agent events by testing if the `isTrusted` property on event objects is true. (The bypass is using Chrome.debugger and the ChromeExtensionDriver I created.)

2. Can't take full page screen captions however it is possible to very quickly take visible scree captions of the viewport. Currently I scroll and stitch the images together if a full page screen is required. There are other APIs which allow this in a Chrome Extension and can capture video and audio but they require the user to click on some button so it isn't useful for computer vision automation. (The bypass is once again using the Chrome.debugger and ChromeExtensionDriver I created.)

3. Chrome DevTool Protocol allows intercepting and rewriting scripts and web pages before they are evaluated. With manifest v2 this was possible but they removed this ability in manifest v3 which we still hear about today with the adblock extensions.

I feel like with the limitations having a popup dialog that directs the user to do an action will work as long as it automates 98% of the user's workflows. Moreover, a lot of this automation should require explicit user acknowledgments before preceding.

[0] https://www.demandsage.com/chrome-statistics/

  • > Currently I scroll and stitch the images together if a full page screen is required.

    Actually, I wish this was exposed as an alternative full-page screenshot method in CDP. The dev tools approach very frequently does not work with SPAs that lazy load/unload, etc.

  • Biggest drawback is the distribution medium: Chrome web store has lots of limitations (manifest v3) so the first point is moot.

    Installing untrusted extensions requires a leap of faith that most users won’t and shouldn’t have.

    Fortunately or unfortunately.

> What is the benefit of porting all those tools to extensions?

Personally, I have a browser extension running in my user/personal browser instance that my agent use (with rate-limits) in order to avoid all the captchas and blocks basically. Everything else I've tried ultimately ends up getting blocked. But then I'm also doing some heavy caching so most agent "browse" calls end up not even reaching out to the internet as it's finding and using stuff already stored locally.