← Back to context Comment by WhitneyLand 1 month ago Opus 4.6 was getting this wrong only last week. 3 comments WhitneyLand Reply handoflixue 1 month ago Oh wow, Sonnet still isn't handling it well:Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way." randomtoast 1 month ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 1 month ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
handoflixue 1 month ago Oh wow, Sonnet still isn't handling it well:Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way." randomtoast 1 month ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 1 month ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
randomtoast 1 month ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 1 month ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
handoflixue 1 month ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
Oh wow, Sonnet still isn't handling it well:
Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)
Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)
Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)
Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)
Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)
Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way."
This is because it is without thinking enabled. Of course the results are disappointing.
It seems entirely fair to evaluate a product based on the baseline that the company itself offers.