← Back to context

Comment by lhl

25 days ago

I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.

Here's my repo: https://github.com/lhl/claudecycles-revisited

I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.

As described in the readme of your repo (did you read it?) your agent found the Knuth paper located one directory level above its working directory.

So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.

  • I wonder how common of a problem this will be in the future. The experiment will fail due to improper setup, the human will at best glance over the logs and declare victory, and everyone just believes.

omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things

thx for sharing your test setup, i really appreciate the time you took. this will help me so much