Comment by gaflo

8 days ago

Thanks for documenting your personal observations. I do have a few questions. First, could you expand by giving other examples on how you observed this model to be relentlessly proactive? From my personal experience with prior frontier models using both Claude Code and Codex I found them to already be quite proactive depending on the domain (although Codex a bit less so, which I personally prefer). The main task that they seemed to struggle with for me are tasks that naturally have long run times for the programs the agents wrote, as they didn't seem to have a good intuition for when/how to change approach to minimise the time spent on the task. Specificically if you are trying to scrape sites/services that are heavily guarded against programmatic access or running automated tasks that call LLMs (such as indexing or document extraction). I'm not surprised that for web dev the proactiveness is the most obvious improvement, as I would expect the most common use case with the most training data to be the biggest priority. I have previously built a similar workflow as you described Fable 5 to auto test changes to the website and while it worked somewhat well, it often couldn't identify obvious flaws to the human eye, such as overlapping text or inconsistent font choices as well as bad layout decisions. I do like it for quick prototyping, but the testing and design decisions were not ones I would hand off at this moment. Did you notice improvements in these areas? Can you share how it does for long running programs?

If you want I can give you some more specific instructions to test, but I would also be happy to hear from your own use cases.

The visual regression point is interesting. In my experience, the models that do best at "overlapping text/bad layout" catches are the ones being fed actual screenshots rather than DOM snapshots. If Fable is doing screenshot-based diffs natively, that would explain an improvement there, but I haven't verified it.

  • From how Simon described it it's not a native feature, but one that the model built as a solution for automatically testing. You could already instruct the agent to write a program that saves screenshots to disk and then reads it. As long as the model is multimodal (which pretty much all releases are these days) it can "natively" interpret images. There's probably a clever way to engineer this to be somewhat efficient, but for me it was rather token hungry, because the testing inputs and the description are usually quite verbose. I suppose you could use a weaker model for navigating the test and then only feed the output to the stronger model.