← Back to context

Comment by lmeyerov

1 day ago

Something I would add is planning. A big "aha" for effective use of these tools is realizing they run on dynamic TODO lists. Ex: Plan mode is basically bootstrapping how that TODO list gets seeded and how todos ground themselves when they get reached, and user interactions are how you realign the todo lists. The todolist is subtle but was a big shift in coding tools, and many seem to be surprised when we discuss it -- most seem to focus on whether to use plan mode or not, but todo lists will still be active. I ran a fun experiment last month on how well claude code solves CTFs, and disabling the TodoList tool and planning is 1-2 grade jumps: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... .

Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)

I'm unsure of its accuracy/provenance/outdatedness, but this purportedly extracted system prompt for Claude Code provides a lot more detail about TODO iteration and how powerful it can be:

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.

  • I find in coding + investigating there's a lot of mileage to being fancier on the todo list. Eg, we make sure timestamps, branches, outcomes, etc are represented. It's impressive how far they get with so little!

    For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...

    In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement

    • Ooh, am I reading correctly that you're using the filesystem as the storage for a "living system prompt" that also includes a living TODO list? That's pretty cool!

      And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.

      1 reply →

  • aren't the system prompt of Claude public in the doc at https://platform.claude.com/docs/en/release-notes/system-pro... ?

    • The system prompt of claude code changes constantly. I use this site to see what has changed between versions: https://cchistory.mariozechner.at/

      It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.

      I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js

      1 reply →

  • From elsewhere in that prompt:

    > Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.

    When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.

I've had a LOT of success keeping a "working memory" file for CLI agents. Currently testing out Codex now, and what I'll do is spend ~10mins hashing out the spec and splitting it into a list of changes, then telling the agent to save those changes to a file and keep that file updated as it works through them. The crucial part here is to tell it to review the plan and modify it if needed after every change. This keeps the LLM doing what it does best (short term goals with limited context) while removing the need to constantly prompt it. Essentially I feel like it's an alternative to having subagents for the same or a similar result

  • I use a folder for each feature I add. The LLM is only allowed to output markdown file in the output subfolder (of course it doesn't always obey, but it still limits pollution in the main folder)

    The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.

    When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.

    This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)

Planning mode actually creates whole markdown files, then wipes the context that was required to create that plan before starting work. Then it holds the plan at the system prompt level to ensure it remains top of mind (and survives unaltered during context compaction).

It’s surprising how simple TodoWrite and TodoRead tools are in planning and making sure an Agent follows the plan.

This is supposed to be an emulator of Claude’s own TodoWrite and TodoRead, which does a full update of a todo.json for every task update. A nice use of composition of edit tool - https://github.com/joehaddad2000/claude-todo-emulator

  • Complex planning and orchestration for Multi-step usecases or persistent Todo lists is achievable by spinning up your own tools that does something similar to this.

    By extending Claude Todo emulator, It was possible to make the agent come up with Multi-step Hierarchical plans and follow it and track updates on it for usecases like Oncall Troubleshooting Runbooks.

    PS: the above open source repo does not provide single task update as a tool, which is not hard to implement on your own

I’m a DIY (or, less generously and not altogether inaccurately, NIH) type who thinks he could do a good job of smarter context management. But, I have no particular reason to know better than anyone else. Tell me more. What have you seen? What kinds of approaches? Who’s working on it?

The TODO lists are also frequently re-inserted into the context HEAD to keep the LLM aware of past and next steps.

And in the event of context compression, the TODO serves as a compact representation of the session.

Oh yes, I commonly add something like "Use a very granular todo list for this task" at the end of my prompts. And sometimes I will say something like "as your last todo, go over everything you just did again and use a linter or other tools to verify your work is high quality"

  • Right now I start chatting with a separate LLM about the issue, the best structure for maintainability and then best libraries for the job, edge and corner cases, how to handle those, and then have it spit out a prompt and a checklist. If it has a UI I'll draw something in paint and refine it before having the LLM describe it in detail and primary workflow etc. and tell it to format it for an agent to use. That will usually get me a functional system on the first try which can then be iterated on.

    That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.

    • Yeah, this is a good idea. I will have a Claude chat session and a Claude Code session open side by side too.

      Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.

  • If you have pre-commit hooks it should do this last bit automatically, and use your project settings

    • Yes, I do. But it does not always use them when I change contexts. I just get in the habit of saying it. Belt and suspenders approach.

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

  • Curious what kinds of evals you focus on?

    We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

    Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

Mind if I ask what models you’re using for CTF? I got out of the game about ten years ago and have been recently thinking about doing my toes back in.

  • Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift

    We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.

    Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)

    • Nice! Thank you!

      I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.

      I'll definitely check out the video later. Thanks!

at the end of the year than you get “How to Code Claude Code in 200 Million Lines of Code” :)