Comment by adenta

8 months ago

I wonder if a nice middle ground would be: - recording the playwright behind the scenes and storing - trying that as a “happy path” first attempt to see if it passes - if it doesn’t pass, rebuilding it with the AI and vision models

Best of both worlds. The playwright is more of a cache than a test

1 comment

adenta

anerli 8 months ago

I think the difficulty with this approach is (1) you want a good "lookup" mechanism - given a task, how do you know what cache should be loaded? you can do a simple string lookup based on the task content, but when the task might include parameters or data, or be a part of a bigger workflow, it gets trickier. (2) you need a good way to detect when to adapt / fall back to the LLM. When the cache is only a playwright script, it can be difficult to know when it falls out of the existing trajectory. You can check for selector timeouts and things, but you might be missing a lot of false negatives.