Comment by zkmon
1 month ago
I think failure is around reasoning where the car is and whether it is needed to be moved to a different place. So it's not surprising that only models with high reasoning would pass the test.
1 month ago
I think failure is around reasoning where the car is and whether it is needed to be moved to a different place. So it's not surprising that only models with high reasoning would pass the test.
No comments yet
Contribute on Hacker News ↗