Comment by stingraycharles 10 days ago That’s without reasoning I presume? 3 comments stingraycharles Reply plexicle 10 days ago 4.6 Opus with extended thinking just now: "At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway." gf000 10 days ago Not the parent poster, but I did get the wrong answer even with reasoning turned on. tezza 10 days ago Thank you all! We needed further data points.comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...
plexicle 10 days ago 4.6 Opus with extended thinking just now: "At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."
gf000 10 days ago Not the parent poster, but I did get the wrong answer even with reasoning turned on. tezza 10 days ago Thank you all! We needed further data points.comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...
tezza 10 days ago Thank you all! We needed further data points.comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...
4.6 Opus with extended thinking just now: "At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."
Not the parent poster, but I did get the wrong answer even with reasoning turned on.
Thank you all! We needed further data points.
comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.
for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.
e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...