Comment by zone411

9 months ago

On the extended version of NYT Connections - https://github.com/lechmazur/nyt-connections/:

Claude Opus 4 Thinking 16K: 52.7.

Claude Opus 4 No Reasoning: 34.8.

Claude Sonnet 4 Thinking 64K: 39.6.

Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).

Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).

Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.

1 comment

zone411

On my Thematic Generalization Benchmark (https://github.com/lechmazur/generalization, 810 questions), the Claude 4 models are the new champions.