Comment by e1g
21 hours ago
Agree, current "thinking" models are effectively "re-run this question N times, and determine the best answer", and this LLM-evaluating-LLM loop demonstrably leads to higher quality answers against objective metrics (in math, etc).
That’s… not how thinking models work. They tend to be iterative and serial, not parallel and then pick-one.
Parallel test time compute is exactly what SOTA models do, including Claude 4 Opus extended, o3 Pro, Grok 4 Heavy, and Gemini 2.5 Pro.