Comment by flakiness
3 months ago
To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.
3 months ago
To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.
I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.
Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces
i mean did you see the cross-tile numbers