Comment by flakiness

3 months ago

To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.

3 comments

flakiness

criddell 3 months ago

I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.

mdahardy 3 months ago

Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces

swyx 3 months ago

i mean did you see the cross-tile numbers