Comment by lhl
24 days ago
I am not a theoretical CS or math expert by any means, but I have been wrangling coding agents for a while and reading the paper and the problems Stapper had with dealing w/ Claude (context management, instruction following, etc) decided to see if I could replicate with a slightly better harness. The results were pretty interesting: https://github.com/lhl/claudecycles-revisited
- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!
- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.
Full comparison/analysis here: https://github.com/lhl/claudecycles-revisited/blob/main/COMP...
I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:
- All built verifiers first
- All maintained worklogs with exact commands
- All archived machine-readable artifacts
- All documented failed approaches
- All maintained restart-safe context capsules
No comments yet
Contribute on Hacker News ↗