Comment by WhitneyLand 3 months ago Opus 4.6 was getting this wrong only last week. 3 comments WhitneyLand Reply handoflixue 3 months ago Oh wow, Sonnet still isn't handling it well:Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way." randomtoast 3 months ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 3 months ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
handoflixue 3 months ago Oh wow, Sonnet still isn't handling it well:Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way." randomtoast 3 months ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 3 months ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
randomtoast 3 months ago This is because it is without thinking enabled. Of course the results are disappointing. handoflixue 3 months ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
handoflixue 3 months ago It seems entirely fair to evaluate a product based on the baseline that the company itself offers.
Oh wow, Sonnet still isn't handling it well:
Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)
Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)
Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)
Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)
Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)
Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way."
This is because it is without thinking enabled. Of course the results are disappointing.
It seems entirely fair to evaluate a product based on the baseline that the company itself offers.