← Back to context

Comment by Klaster_1

9 days ago

It's funny you mentioned "deterministic Playwright code," because in my experience, that’s one of the most frustrating challenges of writing integration tests with browser automation tools. Authoring tests is relatively easy, but creating reliable, deterministic tests is much harder.

Most of my test failures come down to timing issues—CPU load subtly affects execution, leading to random timeouts. This makes it difficult to run tests both quickly and consistently. While proactive load-testing of the test environment and introducing artificial random delays during test authoring can help, these steps often end up taking more time than writing the tests themselves.

It would be amazing if tools were smart enough to detect these false positives automatically. After all, if a human can spot them, shouldn’t AI be able to as well?

I was working on a side project over the holidays with the (I think) same idea as mpalmer imagined there too (though my project wouldn't be interested to him either, because my goal wasn't automating tests)

Basically, the goal would be to do it like with screenshot regression tests: basically you get 2 different execution phases: - generate - verify

And when verify fails in CI, you can automatically run a generate and open a MR/PR with the new script.

This let's you audit the script and make a plausibility check and you'll be notified on changes but have minimal effort to keep the tests running

  • This is super interesting, is it open source? Would love to talk to you more about how this worked

    • Its not at a stage I'd be comfortable to put it on GitHub yet, maybe in a few months.

      And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.

      Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be. On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script

      But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc

      Still very much in the PoC stage without any tests, barely working persistence etc