← Back to context

Comment by flakiness

3 months ago

To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.

I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.

Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces