← Back to context

Comment by simonw

1 day ago

Yeah, 50,000 lines sounds about right for 1m tokens.

If your codebase is larger than that there are a few tricks.

The first is to be selective about what you feed into the LLM: if you know the work you are doing is in a particular area of the codebase, just paste that bit in. The LLM can make reasonable guesses about things the code references that it can't see.

An increasingly effective trick is to arm a tool-using LLM with a tool like ripgrep (effectively the "interactively probe the project in a way that a human would" idea you suggested). Claude Code and OpenAI Codex both use this trick. The smarter models are really good at deciding what to search for and evaluating the results.

I've built tools that can run against Python code and extract just the class, function and method signatures and their docstrings - omitting the actual code. If you code is well designed and has reasonable documentation that could be enough for the LLM to understand it.

https://github.com/simonw/symbex is my CLI tool for that

https://simonwillison.net/2025/Apr/23/llm-fragment-symbex/ is a tool I released this morning that turns Symbex into a plugin for my LLM tool.

I use my https://llm.datasette.io/ tool a lot, especially with its new fragments feature: https://simonwillison.net/2025/Apr/7/long-context-llm/

This means I can feed in the exact code that the model needs in order to solve a problem. Here's a recent example:

  llm -m openai/o3 \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

From https://simonwillison.net/2025/Apr/20/llm-fragments-github/ - I'm populating the context with the exact examples needed to solve the problem.