← Back to context

Comment by technocrat8080

2 hours ago

5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.

People seem to dismiss OSWorld as "OpenClaw," but I think they're missing how powerful and flexible that type of full-interaction for safe workflows.

We have a legacy Win32 application, and we want to side-by-side compare interactions + responses between it and the web-converted version of the same. Once you've taught the model that "X = Y" between the desktop Vs. web, you've got yourself an automated test suite.

It is possible to do this another way? Sure, but it isn't cost-effective as you scale the workload out to 30+ Win32 applications.