Comment by lmeyerov

2 months ago

Something I would add is planning. A big "aha" for effective use of these tools is realizing they run on dynamic TODO lists. Ex: Plan mode is basically bootstrapping how that TODO list gets seeded and how todos ground themselves when they get reached, and user interactions are how you realign the todo lists. The todolist is subtle but was a big shift in coding tools, and many seem to be surprised when we discuss it -- most seem to focus on whether to use plan mode or not, but todo lists will still be active. I ran a fun experiment last month on how well claude code solves CTFs, and disabling the TodoList tool and planning is 1-2 grade jumps: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... .

Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)

33 comments

lmeyerov

btown 2 months ago

I'm unsure of its accuracy/provenance/outdatedness, but this purportedly extracted system prompt for Claude Code provides a lot more detail about TODO iteration and how powerful it can be:

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.

lmeyerov 2 months ago
I find in coding + investigating there's a lot of mileage to being fancier on the todo list. Eg, we make sure timestamps, branches, outcomes, etc are represented. It's impressive how far they get with so little!
For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...
In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement
- btown 2 months ago
  
  Ooh, am I reading correctly that you're using the filesystem as the storage for a "living system prompt" that also includes a living TODO list? That's pretty cool!
  And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.
  
  1 reply →
- jodleif 2 months ago
  
  For humans org-mode is good at this
homarp 2 months ago
aren't the system prompt of Claude public in the doc at https://platform.claude.com/docs/en/release-notes/system-pro... ?
- Stagnant 2 months ago
  
  The system prompt of claude code changes constantly. I use this site to see what has changed between versions: https://cchistory.mariozechner.at/
  It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.
  I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js
  
  1 reply →
- btown 2 months ago
  
  This is for Claude Code, not just Claude.
what 2 months ago

From elsewhere in that prompt:
> Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.
When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.

rrvsh 2 months ago

I've had a LOT of success keeping a "working memory" file for CLI agents. Currently testing out Codex now, and what I'll do is spend ~10mins hashing out the spec and splitting it into a list of changes, then telling the agent to save those changes to a file and keep that file updated as it works through them. The crucial part here is to tell it to review the plan and modify it if needed after every change. This keeps the LLM doing what it does best (short term goals with limited context) while removing the need to constantly prompt it. Essentially I feel like it's an alternative to having subagents for the same or a similar result

tacone 2 months ago

I use a folder for each feature I add. The LLM is only allowed to output markdown file in the output subfolder (of course it doesn't always obey, but it still limits pollution in the main folder)
The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.
When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.
This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)

fastball 2 months ago

Planning mode actually creates whole markdown files, then wipes the context that was required to create that plan before starting work. Then it holds the plan at the system prompt level to ensure it remains top of mind (and survives unaltered during context compaction).

ramoz 2 months ago

I don’t think it wipes the context window.

matchagaucho 2 months ago

The TODO lists are also frequently re-inserted into the context HEAD to keep the LLM aware of past and next steps.

And in the event of context compression, the TODO serves as a compact representation of the session.

dboon 2 months ago

I’m a DIY (or, less generously and not altogether inaccurately, NIH) type who thinks he could do a good job of smarter context management. But, I have no particular reason to know better than anyone else. Tell me more. What have you seen? What kinds of approaches? Who’s working on it?

lmeyerov 2 months ago

I'm optimistic most people can, given the time and resources
In the CCC video, you may enjoy the section on how we are moving to eval-driven AI coding for how we more methodically improve agents. Even more so, the slides before on motivating why it gets harder to improve quality as you go on.
One big rub is it's one of those areas where people grossly misunderestimate what is needed for the quality goals they're likely targeting, and if a long-living artifact to be maintained, the on-going costs. It's similar to junior engineers or short-term contractors who never had to build production-grade software before and haven't had to live with their decisions: These are quite learnable engineering skills, and I've found it useful to burn your fingers before having confidence in the surprising weight of cost/benefit decisions. The more autonomy and expectations you are targeting for the agent, the more so.

shnpln 2 months ago

Oh yes, I commonly add something like "Use a very granular todo list for this task" at the end of my prompts. And sometimes I will say something like "as your last todo, go over everything you just did again and use a linter or other tools to verify your work is high quality"

hadlock 2 months ago
Right now I start chatting with a separate LLM about the issue, the best structure for maintainability and then best libraries for the job, edge and corner cases, how to handle those, and then have it spit out a prompt and a checklist. If it has a UI I'll draw something in paint and refine it before having the LLM describe it in detail and primary workflow etc. and tell it to format it for an agent to use. That will usually get me a functional system on the first try which can then be iterated on.
That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.
- shnpln 2 months ago
  
  Yeah, this is a good idea. I will have a Claude chat session and a Claude Code session open side by side too.
  Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.
renjimen 2 months ago
If you have pre-commit hooks it should do this last bit automatically, and use your project settings
- shnpln 2 months ago
  
  Yes, I do. But it does not always use them when I change contexts. I just get in the habit of saying it. Belt and suspenders approach.

sathish316 2 months ago

It’s surprising how simple TodoWrite and TodoRead tools are in planning and making sure an Agent follows the plan.

This is supposed to be an emulator of Claude’s own TodoWrite and TodoRead, which does a full update of a todo.json for every task update. A nice use of composition of edit tool - https://github.com/joehaddad2000/claude-todo-emulator

sathish316 2 months ago

Complex planning and orchestration for Multi-step usecases or persistent Todo lists is achievable by spinning up your own tools that does something similar to this.
By extending Claude Todo emulator, It was possible to make the agent come up with Multi-step Hierarchical plans and follow it and track updates on it for usecases like Oncall Troubleshooting Runbooks.
PS: the above open source repo does not provide single task update as a tool, which is not hard to implement on your own

veselin 2 months ago

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

lmeyerov 2 months ago
Curious what kinds of evals you focus on?
We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.
Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.
- veselin 2 months ago
  
  I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.
  
  1 reply →

jcims 2 months ago

Mind if I ask what models you’re using for CTF? I got out of the game about ten years ago and have been recently thinking about doing my toes back in.

lmeyerov 2 months ago
Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift
We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.
Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)
- jcims 2 months ago
  
  Nice! Thank you!
  I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.
  I'll definitely check out the video later. Thanks!

bdangubic 2 months ago

at the end of the year than you get “How to Code Claude Code in 200 Million Lines of Code” :)

redanddead 1 month ago

ironic how the peak of code is a TODO list app