Comment by zone411
2 months ago
On the extended version of NYT Connections - https://github.com/lechmazur/nyt-connections/:
Claude Opus 4 Thinking 16K: 52.7.
Claude Opus 4 No Reasoning: 34.8.
Claude Sonnet 4 Thinking 64K: 39.6.
Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).
Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).
Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.
On my Thematic Generalization Benchmark (https://github.com/lechmazur/generalization, 810 questions), the Claude 4 models are the new champions.