Comment by dangerlego5
8 days ago
The visual regression point is interesting. In my experience, the models that do best at "overlapping text/bad layout" catches are the ones being fed actual screenshots rather than DOM snapshots. If Fable is doing screenshot-based diffs natively, that would explain an improvement there, but I haven't verified it.
From how Simon described it it's not a native feature, but one that the model built as a solution for automatically testing. You could already instruct the agent to write a program that saves screenshots to disk and then reads it. As long as the model is multimodal (which pretty much all releases are these days) it can "natively" interpret images. There's probably a clever way to engineer this to be somewhat efficient, but for me it was rather token hungry, because the testing inputs and the description are usually quite verbose. I suppose you could use a weaker model for navigating the test and then only feed the output to the stronger model.