Comment by dataviz1000

4 days ago

I made this comment yesterday but really applies to this conversation.

> In the past 3 weeks I ported Playwright to run completely inside a Chrome extension without Chrome DevTools Protocol (CDP) using purely DOM APIs and Chrome extension APIs, I ported a TypeScript port of Browser Use to run in a Chrome extension side panel using my port of Playwright, in 2 days I ported Selenium ChromeDriver to run inside a Chrome Extension using chrome.debugger APIs which I call ChromeExtensionDriver, and today I'm porting Stagehand to also run in a Chrome extension using the Playwright port. This is following using VSCode's core libraries in a Chrome extension and having them drive a Chrome extension instead of an electron app.

The most difficult part is managing the lifecycle of Windows, Pages, and Frames and handling race conditions, in the case of automating a user's browser, where, for example, the user switches to another tab or closes the tab.

Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.

We need the agent to be able to drive 1password, Privacy.com, etc. to request per-task credentials, change adblock settings, get 2fa codes, and more.

The holy grail really is CDP + control over browser launch flags + an extension bridge to get to the more ergonomic `chrome.*` APIs. We're also working on a custom Chromium fork.

  • Use an Electron app to spawn a child process to open a Chrome browser using the launch flags including `--remote-debugging-pipe` -- instead of exposing a websockets connection on port 9226 or something -- which, if coupled with `--user-data-dir=<path>`, will not show the security CDP bar warning at the top of the page as long as the user data directory is not the default user directory.

    1. Get all the things you want.

    2. Can create as many 'browser context' personas as you want

    3. Use the Electron app renderer for UI to manage profiles, proxies for each profile, automate making gmail accounts for each profile, ect.

    4. Forgot, it is very nice using the `--load-extension=/path/to/extension` flag to ship chrome extension files inside the Electron app bundle so that the launched browser will have a cool copilot side panel.

    > Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.

    5. If you know the extension ids it is easy to set up communication between the two. I already drive a Chrome extension using VSCode's core libraries and it would be a week or two of work to implement a light port of the VSCode host extension API but for a Chrome extension. Nonetheless, I'd rather have an Electron app to manage extensions the same way a VSCode does.

    • Yeah I started building this in my first week at the company haha: https://github.com/browser-use/desktop

      Shipping a whole electron app is not a priority at the moment though, our revenue comes from cloud API users, and there we only need our custom chrome fork, no point messing with electron and extension bridges when we can add custom CDP commands to talk to `chrome.*` APIs directly.

      3 replies →

    • Yes an electron app helps tremendously, especially for managing lifecycle of tabs independently. We use that for creating our AI browser automations at Donobu (https://donobu.com). However, we do have the luxury of just focusing on a narrow AI QA use case vs. Browser-Use and others who need to support broad usecases in potentially adversarial environments.

What is the benefit of porting all those tools to extensions? Have you ran into any other extension-based challenges besides lifecycles and race conditions?

  • Some benefits (without using Chrome.debugger or Chrome DevTools Protocol):

    1. There are 3,500,000,000 instances of Chrome desktop being used. [0]

    2. A Chrome Extension can be installed with a click from the Chrome Web Store.

    3. It is closer to the metal so runs extremely fast.

    4. Can run completely contained on the users machine

    5. It's just one user automating their web based workflows making it harder for bot protections to stop and with a human-in-the-loop any hang ups and snags can be solved by the human

    6. Chrome extensions now have a side panel that is stationary in the window during navigation and tab switching. It is exactly like using the Cursor or VSCode side panel copilots

    Some limitations:

    1. Can't automate ChatGPT console because they check for user agent events by testing if the `isTrusted` property on event objects is true. (The bypass is using Chrome.debugger and the ChromeExtensionDriver I created.)

    2. Can't take full page screen captions however it is possible to very quickly take visible scree captions of the viewport. Currently I scroll and stitch the images together if a full page screen is required. There are other APIs which allow this in a Chrome Extension and can capture video and audio but they require the user to click on some button so it isn't useful for computer vision automation. (The bypass is once again using the Chrome.debugger and ChromeExtensionDriver I created.)

    3. Chrome DevTool Protocol allows intercepting and rewriting scripts and web pages before they are evaluated. With manifest v2 this was possible but they removed this ability in manifest v3 which we still hear about today with the adblock extensions.

    I feel like with the limitations having a popup dialog that directs the user to do an action will work as long as it automates 98% of the user's workflows. Moreover, a lot of this automation should require explicit user acknowledgments before preceding.

    [0] https://www.demandsage.com/chrome-statistics/

    • > Currently I scroll and stitch the images together if a full page screen is required.

      Actually, I wish this was exposed as an alternative full-page screenshot method in CDP. The dev tools approach very frequently does not work with SPAs that lazy load/unload, etc.

    • Biggest drawback is the distribution medium: Chrome web store has lots of limitations (manifest v3) so the first point is moot.

      Installing untrusted extensions requires a leap of faith that most users won’t and shouldn’t have.

      Fortunately or unfortunately.

  • > What is the benefit of porting all those tools to extensions?

    Personally, I have a browser extension running in my user/personal browser instance that my agent use (with rate-limits) in order to avoid all the captchas and blocks basically. Everything else I've tried ultimately ends up getting blocked. But then I'm also doing some heavy caching so most agent "browse" calls end up not even reaching out to the internet as it's finding and using stuff already stored locally.

is this open source ? just curious to see this. sounds fascinating!

  • What I have so far needs a lot of work and is flaky. Everyday it is getting tighter and better.

    Microsoft pulled out the lifecycle management code from Puppeteer and put it into Playwright with Google's copyright still at the top of the several files. They both use CDP. I'm using the Chrome extension analogue for every CDP message and listener. I need a couple days to remove all the code from the Page, Frame, FrameManager, and Browser classes and methodically make a state machine with it to track lifecycle and race conditions. It is a huge task and I don't want to share it without accomplishing that.

    For example, there is a system that listens for all navigation requests in a Page's / Tab's Frames in Playwright. Embedded frames can navigate to urls which the parent Frame is still loading such as advertising resources, all that needs to be tracked.

    There are a lot of companies that are talking about building solutions using CDP without Playwright and I'm curious how well they are going to handle the lifecycle management. Maybe if they don't intercept requests and responses it is very straight forward and simple.

    One idea I have is just evaluate '1+1' in the frame's content script in a loop with a backoff strategy and if it returns 2 then continue with code execution or if it times out fail instead of tracking hundreds of navigations with with 30 different embedded frames in a page like CNN. I'm still tinkering. Stagehand calls Locator.evaluate() which is what I'm building because I haven't implemented it yet.

    • Yes the key is we don't intercept requests and responses, that saves 60% of the headache of lifecycle management.

      We do exactly what you described with a 1+1 check in a loop for every target, it pops any crashed sessions from the pool, and we don't keep any state beyond that about what tabs are alive. We really try to derive everything fresh from the browser on every call, with minimal in-memory state.

      https://github.com/browser-use/browser-use/blob/2a0f4bd93a43...

      1 reply →

    • very cool! software never gets done ....so would have loved to see it (and contributed to it). But totally respect that. Would love to see it when ur done!

      for context, i contribute to an opensource mobile browser (built on top of chromium), where im actually building out the extension system. Chrome doesnt have extensions on mobile! would have loved to see if this works on android...with android's lifecycle on top of your own !!!

      1 reply →

Yes, that is the most difficult part. But none of the frameworks adequately handle that and that was a major reason I’ve used CDP since the beginning.

From day one with BrowserBox, we have been using CDP, unadulterated by any higher abstractions. Despite the apparent risk that terrifying changes in the tip-of-tree protocol would lead to disastrous code migrations, none of that ever occurred. The most rewritten code in the application is consistently user interface and core features.

Over the nearly 8 years of BrowserBox’s existence, CDP-related changes due to domain and method deprecations, or subtle changes in parameters or behavior, have been only a very minor maintenance burden. A similar parallel could probably be drawn by examining the Chrome DevTools front-end, another gold-standard CDP-based application, and even digging into its commit history to see how often changes regarding CDP were actually due to protocol-breaking changes.

That was my sense when I began this project: that the protocol is not going to change that much, and we can handle it. My other reason for not choosing Puppeteer or Playwright was that I was dissatisfied with the abstractions they imposed atop CDP, and I found them insufficiently expressive or flexible for the actual demanding use cases of virtualizing a browser in all its aspects — including multiple tabs, managing and bookkeeping all of that state required to do that.

The CDP protocol is still the gold standard for browser instrumentation. It would be nice if Firefox had not deprecated support, and it would be even nicer if WebDriver BiDi was a sufficient and adequate replacement for CDP, which for now it is not. The behavior, logic, and abstractions of CDP are well thought out and highly appropriate for its problem domain. It’s like separating a browser’s engine from its user interface, which is one of the core things BrowserBox accomplishes.

Working with CDP is apparently “difficult,” but that’s just another myth. It’s incredibly easy to write a hundred-or-so-line promise-resolving logic library to ensure you get responses. I’ve done this, and it works. I have used CDP alone in two major, thousands-of-stars, thousands-of-users, significant browser-related projects (the other is DiskerNet), and I have never regretted that choice, nor ever wished that I had switched to Puppeteer or Playwright.

That said, I think the sweet spot for Puppeteer and Playwright is quickly putting together not-overly-complex automation tasks, or other specific browser-instrumentation-related tasks with a fairly narrow scope. The main reason I used CDP was because I wanted the power of access to the full protocol, and I knew that would be the best choice — and it was.

So if your browser-related project is going to require deep integration with the browser and access to everything that’s exposed, don’t even think twice about using CDP. Just use it. The only caveat I would make to that is: keep an eye on whether WebDriver BiDi capabilities become sufficient for your use case, and seriously consider a WebDriver BiDi implementation, because that gives you a broader swath of browsers you’ll be able to use as the engine.

[1] https://chromedevtools.github.io/devtools-protocol/

[2] https://github.com/ChromeDevTools/devtools-frontend

[3] https://w3c.github.io/webdriver-bidi/

[4] https://pptr.dev/

[5] https://playwright.dev/