← Back to context

Comment by plufz

18 hours ago

But we can use an LLM to write that script though and give that agent access to a browser to find DOM selectors etc. And than we have a stable script where we, if needed, manually can fix any LLM bugs just once…? I’m sure there are use cases with messy selectors as you say, but for me it feels like most cases are better covered by generating scripts.

Yeah we've though about this approach a lot - but the problem is if your final program is a brittle script, you're gonna need a way to fix it again often - and then you're still depending on recurrently using LLMs/agents. So we think its better to have the program itself be resilient to change instead of you/your LLM assistant having to constantly ensure the program is working.

  • I wonder if a nice middle ground would be: - recording the playwright behind the scenes and storing - trying that as a “happy path” first attempt to see if it passes - if it doesn’t pass, rebuilding it with the AI and vision models

    Best of both worlds. The playwright is more of a cache than a test

    • I think the difficulty with this approach is (1) you want a good "lookup" mechanism - given a task, how do you know what cache should be loaded? you can do a simple string lookup based on the task content, but when the task might include parameters or data, or be a part of a bigger workflow, it gets trickier. (2) you need a good way to detect when to adapt / fall back to the LLM. When the cache is only a playwright script, it can be difficult to know when it falls out of the existing trajectory. You can check for selector timeouts and things, but you might be missing a lot of false negatives.

  • Are you sure? Couldnt you just just go back to the LLM if the script breaks? Pages changes but not that often in general.

    It seems like a hybrid approach would scale better and be significantly cheaper.

    • We do believe in a hybrid approach where a fast/deterministic representation is saved - but think there is a more seamless way were the framework itself is high level and manages these details by caching the underlying actions that can run