Comment by euphetar
11 hours ago
I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents
Try playing fruit ninja via text and llm toolcalls though
No comments yet
Contribute on Hacker News ↗