Show HN: Transform your codebase into a single Markdown doc for feeding into AI

8 days ago (tesserato.web.app)

CodeWeaver is a command-line tool designed to weave your codebase into a single, easy-to-navigate Markdown document. It recursively scans a directory, generating a structured representation of your project's file hierarchy and embedding the content of each file within code blocks. This tool simplifies codebase sharing, documentation, and integration with AI/ML code analysis tools by providing a consolidated and readable Markdown output.

CodeWeavers is a software company that focuses on Wine development and sells a proprietary version of Wine called CrossOver for running Windows applications on macOS, ChromeOS and Linux.

https://en.wikipedia.org/wiki/CodeWeavers

Trademark is active. It's an Ⓡ not just a ™, registered not just trademarked. To keep it, they have to demonstrate they defend it.

https://www.trademarkia.com/codeweavers-76546826

While this project drops the final "s", you don't get to launch an OS called "Window". The test is a fuzzy match based on likelihood of confusion.

  • Yeah, I was thinking "what does the Wine guys have to do with this?"

    This project is definitely going to get C&D'd.

    • Do you think they would actually litigate? They seem like different products serving entirely different markets so I am not sure that the trademark infringement claim is very defensible. And how do they prove damages?

I use the following for feeding into AI

   find . -print -exec cat {} \; -exec echo \;

Which will return for each file (and subfolders) the filename and then the content of the file.

Then `| pbcopy` to copy to clipboard and paste it into ChatGPT or similar.

Tip: If you ever need to do this on a public GitHub repository you can use "gitingest".

This will open a website that creates a copy of all the file contents of the repo (code, docs, ...) It's a great tool to use when using new/obscure code with LLMs in my opinion.

The UX is so just easy and great, change the URL from <https://github.com/user_name/repo_name> to <https://gitingest.com/user_name/repo_name>

//edit: fixed URLs

  • I copied the UX to my https://gitpodcast.com (creates podcast on a github repo, same replace `hub` with `podcast`)

    • i am very impressed by gitpodcast, i just listened to one podcast and first of all i am pleased with the idea, the voices are also pleasant to listen to. thanks for sharing!

Unfortunate naming, given that CodeWeavers is already a company making a Windows "emulator" for Linux and macOS. [1]

[1] https://www.codeweavers.com/

  • CodeWeavers are actually making wine, not just some "emulator". They then distribute this along with some QOL tools as a commercial product called CrossOver.

  • All names are taken. There's no need to point this out every time.

    • Huewoblfan is not taken! Noiewoidc is free. XIONqlic – totally available, can mean a range of things! Ciohupoij – a bit of asian flavour but still a valid free name.

How does this compare to / differ from https://github.com/yamadashy/repomix ?

  • Some advantages of CodeWeaver are that it is compiled, so it might be faster; you can grab a compatible executable from the releases section instead of using `go install` so, no dependencies. You can manually specify what to exclude via a comma-separated list of regular expressions so it might be more flexible. I never used Repomix so, those assumptions might not hold. On the other hand, remix seems to be awfully more complete, a full-fledged solution to convert source code to monolithic representations. I wrote CodeWeaver because I only needed something that worked and, occasionally, I could trust to keep sensitive data away from sketchy LLMs (And wasn't aware of other solutions).

I really want a tool like this that can extract a function and its dependency graph (to a certain depth maybe, and/or exclude node_modules).

I wrote this library [1] and hope to add the fine-grained "reference resolution" utility to it at some point, which could make implementing such a tool a lot simpler.

[1]: https://github.com/aleclarson/ts-module-graph

I use aider /copy-context command for that

https://aider.chat/docs/usage/copypaste.html

and with /paste you can apply the changes.

  • Thanks for letting folks know about aider's /copy-context command.

    To add some more detail, aider has a mode/UX that is optimized for "copy and paste" coding with LLM web chats. The "big brain" LLM in the web chat does the hard work, and a cheap/local LLM works with aider to apply edits to your local files.

    There's a little demo video in the link above that should give you the gist.

I’ve made a CLI tool that does something similar, called Copcon:

https://github.com/kasperjunge/copcon

Point it at a code project directory to get a file tree and content, optionally with a git diff, copied to the clipboard - ready for copy pasting into ChatGPT.

It is very true that this only works for small projects, as you will bloat the LLM’s context with large codebases.

My solution to this is two files you can use to steer the tool’s behavior:

- .copconignore: For ignoring specific files and directories.

- .copcontarget: For targeting specific files and directories (applied before .copconignore).

These two files provide great control over what to include and exclude in the copied context.

A new tool like this comes out every week, and that's great! But I think it's fair to ask how this compares to popular ones like RepoMix? Anyone keeping an eye on this space will want to know why this is different from what's already out there and being used.

  • I actually wrote this a couple of months ago, so perhaps nothing similar existed back then (I remember doing some research back then, mostly focused on VS Code plugins). Nevertheless, the idea was also to test how Golang could facilitate the distribution of such micro tools throughout the internal team, so I probably would have still made it. It is nice to know that similar tools exist. I'll take a look at them.

  find . -type f -name '*.py' -exec sh -c 'echo "# $1"; cat "$1"; echo ""' _ {} \; | pbcopy

Somewhat related. I built an Elm app all in one file as an experiment and to see if I like it. It's a little over 7k lines and I'm occasionally adding more to it.

It's actually pretty straightforward if you're in a language with lexical scoping, and it simplifies some things, like includes / cyclical, no modules, no hunting through files, etc.

I feel like this set up could integrate really well w/ AI models.

I've found that the only real limitation, at least in my experiment, was a lack of decent editor support. I use vim so this wasn't really much of an issue for me with many great ways to navigate a file, and a combination of vertical and horizontal splits on a large screen, but when I opened it up in other "modern" editors the ergonomics fell apart quite a bit.

I think the biggest downside was re-using variable names between large scopes occasionally made it hard to find the reference I wanted (E.g. i, x, key, val), but again, better editor support allowing you to limit your search to within the current scope would help. Also easily mitigated with more verbose throwaway variable naming.

  • I write Elm and use Emacs primarily, and sometimes neovim. Are you using lsp in vim? You’re doing it right by staying in one file until it hurts, that’s the recommendation for Elm, but I can’t recall if I’ve had issues using go-to-def or other lsp functions like your describing

    • No LSP. It honestly doesn’t speed me up any. I already have the standard library memorized, plus some of the common community lib methods (List.Extra) and my typing speed is faster than I can think anyways.

      I’m thinking the same approach would also work well in F#, Haskell, OCaml.

  • > no hunting through files, etc.

    It’s easy to switch to files by name with a few keystrokes. Files are names to group things I’m looking for.

    I would much rather do that than try to search through a 7,000 line file for what I need.

    > I feel like this set up could integrate really well w/ AI models.

    Massive files or too many files break AI models. Grouping functionality into smaller files and including only relevant files is key. The file and folder names can be hints about where to find the right files to include.

    • > I would much rather do that than try to search through a 7,000 line file for what I need.

      I mean I'm not arguing for it as a best practice. I did it as an experiment (as I stated), and discovered it's actually really easy, and snappy for me to navigate in Vim. Mileage may vary with other editors. Have you tried it?

      > Massive files or too many files break AI models

      It's growing faster than I code! With the latest Gemeni at least it's much larger at 1-2 mil tokens. I'm sure we'll hit a ceiling though, but I also think we may find some context caching / rag type optimizations eventually.

  • The big problem with that is you’ll eventually blow your context window feeding the model with stuff that it mostly doesn’t need in order to complete its task.

    • I can’t think of anything I’d want to add to the context for Elm at least, assuming the standard libraries are already in the model (or can be added via RAG). Gemeni is 2m tokens now and I expect this will grow at least until it’s no longer meaningful.

This is like a rediscovery of an org-mode capability that has existed for decades, and doesn't do as much.

  • Is it? I use org-babel regularly but wasn't aware of it - what's the function called? As great as org-mode / org-babel is, the user base is too small to not be overlooked.

    • Well in general I've put entire projects into org docs, and ran the code blocks, essentially using it like a Jupyter notebook (although honestly it wasn't always as smooth as I'd like). And I haven't done this myself, but there's a neat literate programming talk from the last EmacsConf[0] in which the presenter showed some custom capabilities which improved the experience even more for him.

      [0] https://emacsconf.org/2024/talks/literate/

Following the /llms.txt standard proposition, I create a MkDocs plugin that generates an /llms.txt file at the root of your site. So, same thing, but generates the Markdown document from your docs (possibly containing API reference) instead of your code.

Such a functionality would be useful for developing some scripts and then converting to a Quarto document [1].

[1] https://quarto.org/

  • I've never used Quarto, but I might give it a go someday. I currently have a convoluted workflow for generating math-heavy documents that involves generating equations using SymPy in a notebook, accumulating them in a string, and ultimately dumping the string into a Markdown. I would love to simplify this sooner rather than later. I'm also keeping an eye on https://typst.app/ and hoping for a sane alternative to LaTeX to emerge.

This could be a lot better. The example linked in the Github README is a markdown file full of binary garbage because it also tried to convert gzip files to markdown.

Pretty big flag that this isn't ready for primetime.

My codebase sitting at 4M lines: hold my spaghetti.

  • You can ask Cursor to use information from specific folder (aka your 4M lines) and it would summarize it and use that.

    Not a replacement for full 4M lines but it might work for some tasks/prompts

This kind of context is really useful for LLMs, but in any significant project, including all code in this manner will easily exceed context limitations. I've been wanting to do something like this for my php projects, but instead of dumping the entire files, would just create a map of its methods signatures, variables, etc. That should give good enough information of what each file is used for and can do, while being small enough to be ingested by AI.

  • > including all code in this manner will easily exceed context limitations

    The context window for Gemini 2.0 Flash can handle roughly 50000 lines of code, and 2.0 Pro can handle twice that.

    • that goes faster than you think. Also, diminishing attention/memory of facts in context also goes down together with its length. Which might hurt when you just want to dump everything at once.

For extra points, compile your docs into one file and feed it that as well.

(unless the reason you're giving AI the code is that you don't have any docs for either humans or machines)

Anybody with experience of using something like this with a big codebase and Gemini 2M context window? I tried a while ago (before 2.0 Flash) to solve some refactoring tasks and even after spending some time on prompt wrangling I didn't manage to get good results out of it.

I don't know what kind of agent architecture Cursor uses internally but it seems much better designed at finding where changes need to be made.

  • In my experience with feeding large codebases to Gemini, simple tasks work ok (enumerate where such and such happens, find where a certain function is called, list TODOs throughout the code, etc), but tasks that require a bit more logic are trickier. Nevertheless, I had some success with moderate complex refactoring tasks in Python codebases.

This thread has convinced me that Aider/Cursor need to do more marketing.

  • Same for windsurf, I’ve been using it to generate documentation for code bases. It will generate markdown with mermaid diagrams to explain whatever you want to know; from the component architecture of an entire application, to the sequence diagram for a specific button, and data and ER diagrams.

    But the approach to fit your entire codebase into one document so you can include it in your prompt context seems a dead end, instead the llm can use an agent to do targeted search through your code.

  • Cursor is all the rage. Nobody talks about Aider, sadly.

    • I partially disagree. Maybe it depends what circles you run in but at least here on HN I’ve seen Aider mentioned more times than I can count. Is cursor more popular? Yeah…but the people here are talking about Aider. That’s how I learned about it.

I made a similar tool in Golang, https://github.com/foresturquhart/grimoire. It tries to be a bit cleverer, by prioritising files that have had many commits, respecting .gitignore files, and excluding useless content like binaries or vector images.

  • I can think of no use case where binaries are desired in such representation, so I might bake binary exclusion into CodeWeaver as well. SVGs, on the other hand, might be wanted sometimes, in web design contexts. I'll take a look at your implementation and see what I can learn.

Wouldn't it be wonderful to have a tool where you interact with AI interactively through the codebase via IDE / vim / emacs tree? Say, you open your codebase and start with prompts and AI+tool navigates to a function or a place where it needs to and modifies stuff while chatting to you about it? Or you jump to somewhere, highlight where you are to scope down the focus of it (while it still retains all of the code in history/memory). Sort of like pair programming. It sounds so obvious that I'm almost sure I've missed that already existing somewhere. I think I tried google's thing (forgot the name) but it sucked / wasn't that.

  • I think you’re describing Aider.chat. There are 2 Emacs packages for it, one official and a very recent fork. Aider is a cli so it works great with vim as well.

    In Emacs I’ve had good experience with gptel as well but I prefer aider for the coding workflow

    • Yep, I've particularly been enjoying the recent "watchfiles" feature where a comment can be added to the source file, and ending it with "ai?" or "ai!" triggers use of said comment as a prompt to ask about or change that section upon save.

  • Apologies if I'm missing something, but aren't you describing Cursor/Copilot/Windsurf?

    • you're not. looks like that's kind of it, but would the thing have the context of the whole project when I'm in a file/class/function? With copilot, in my case, it was so far mostly like a fancy autocomplete that has immediate vicinity in its memory where it would be vastly more useful if it had the context of the whole project / all files.

      2 replies →

  • This doesn't sound good to me, you end up with a large codebase that no human has actually laid eyes on. When you get a bug weird enough that you can't reason the LLM through it, then what? What if a bug is because of interactions between two systems, and you don't own one of them? What if there's an issue due to convoluted business process failures, that just end in a bug report like "my data is missing!"? I honestly think in the latter case, the LLM will just fix a 'bug' and miss the forest for the trees.

    I prefer the idea of the other comment reply where you use AI as a tool to explore a codebase and assist you, not something you instruct to do the work. It can accelerate you building that experience and intuition at a level we've never been able to do before.

    • An llm itself is a large codebase that no human has laid eyes on, instead you validate it through testing.

      Regarding testing, I’ve had an interaction with windsurf where I told it there was a bug in the application it generated. It replied “I’ve added some log statements, can you run it and tell me what you see, then I’ll know what to fix”… The llm was instructing me…

    • Nothing like that at all. For example I have a few codebases kind of large (for certain quantity of large) where I know the code since either I wrote it or participated heavily in. Talking snippets at a time loses a ton of context which would yield better offered solutions if you had, well.. the whole context.

  • I tried various solutions but I still haven’t found a chat tool that allows me to navigate a large monorepo. I’d like to be able to say "open the file where there is the function to do <xyz>", but current tools don’t understand that.

    • This works fine in Cursor. As far as I know, you can't say "open the file..." but you can say "where is the function to do <xyz>" and it'll include a link to the file in it's response and then you can click to open it.

Whilst the pendulum seems well on its way to be swinging from microservices back to monoliths, I'm thinking we'll end up in a place that limits the volume and complexity of the code in a single service so that it's just large enough to encompass a point of single responsibility.

Then we can easily drop in and out of using LLMs in the code space.

Service Oriented Architecture lends itself well to the limited context of these models.

Maybe we can revive literate programming and simply build everything from a single markdown document..

  • Microservices lend themselves to architectural decisions that LLMs are just not trained to understand.

    It's one thing to have it be trained in billions of loc and be useful, its another for it to have enough quality dataset to have enough context and understanding of something like Kafka partition ordering and its possible interactions with something like a database and at-least once delivery. It will give you an explanation of those things in isolation, but not in combination.

Any unique benefits over using this vs something like Repomix? https://github.com/yamadashy/repomix

  • CodeWeaver is compiled, so it might be faster. Also, you can grab a compatible executable from the releases section, and you're good to go, instead of using `go install` so, no dependencies. Personally, I considered following the `.gitignore` route but found that manually specifying what to exclude via a comma-separated list of regular expressions provided me with the flexibility I needed (initial setup might be a bit tedious, though, but, then again, you can use an LLM for that).

I could see this being quite useful in the background for apps like cursor when they need to perform a full codebase search. I imagine it could be more effective in breaking up larger codebases where embeddings start to fall out. If you could fit the entire document into context, you'd be able to "point the model" in the right direction.

The challenge is maintaining it... But you'd maybe ask the model to do that incrementally on every commit, or just throw it away and regenerate from scratch occasionally.

See the script I created that does something similar with a few improvements for large projects:

https://paste.mozilla.org/9rD95yAy

I would like to be able to create sets of files that I can easily send to the clipboard in this kind of format. The files could correspond to the ones relevant to a particular feature, etc. They don't always fall under the same subtree of the source code, and the entire source code is too big for the context.

Which like, kinda neat that it exists, but who's using tooling that bad that they're manually copying and pasting that much code into, what, a web browser text entry box?

Use better tools people!

  • I have always used o1 pro and deep research, but these are only available through the web UI. there is no doubt that cursor and others have a better UI, but the demand for this type of tool exists because OpenAI does not release an API

Does anyone know of tools that go the other direction? i.e. taking a technical writeup (scientific paper, architecture docs, or similar) and emitting a candidate codebase.

  • Yes, I often use one LLM to generate a PRD and the include it in the codebase, then ask Cursor agent to implement some part of the system using the PRD as a reference. It can't emit an entire codebase in one-shot (unless it's trivial project like "build me a flappy bird clone") but you can use it as scaffolding to manage implementing a whole project in chunks.

  • I don’t know of a tool but I’ve had some success doing this with a one shot short prompt. I say something like, “Here’s a readme. Develop this in Go.” Followed by the readme.

    I’ve been getting complete working code with this strategy but I’m creating projects that are relatively simple.

    I also notice that I have to give a little deeper context about “how” it should work, which I normally wouldn’t do.

Given the limited context length of most LLMs, is there value in turning in an entire codebase into a doc to feed it into an LLM?

I think cherry-picking relevant sections would be necessary to make it function effectively. Has anyone tried using tree-sitter to recursively feed it the source for functions used in the section we want to analyze to optimize for this?

Interesting. I've been converting Jupyter notebooks into markdown for the same purpose. Am considering making a custom tool.

  • I also have this use case, and would be interested in such a tool. If you intend to write your tool in Golang, consider instead extending CodeWeaver.

If I'm reading this correctly, why include all code into the markdown? It's almost like the AI model that would use this is necessarily using all concatenated code plus explanation of the code, I'm not sure which is better because the LLM then already has access to the entire code as part of markdown?

Oh, cool -- this is made with golang! I'll have see if I could wrap it in a desktop gui using wails.

I see lots of folks here using LLMs in their codebases. Does that mean there isn’t much concern about sharing your app’s code with an LLM? Have people just gotten comfortable with this now? Or does it only matter for closed source or proprietary code bases ?

  • You can run an llm on your local machine, and you can get llm sandboxes for your company.

Is this related to https://gitingest.com/ at all? Which seems to be a service doing a similar thing.

  • There are a ridiculous number of projects doing this.

    I'm always baffled by the response they get since doing this is also the most impractical, poorly scaling, way to insert an LLM into your development process.

    On one hand if you realize that, there may be times where you get lucky with the size of a codebase and the nature of your questions and it works acceptably.

    But on the other, this feels like the kind of thing someone who's hearing others rave about the utility of AI will try with too large of a codebase, insert the result into ChatGPT, and then get an LLM underperforming because it's being flooded with irrelevant context for every basic operation it's being asked to do.

    There are very few times when providing the entire codebase in the context window instead of the relevant code to a single operation makes sense.

This is great, but I’m pretty sure this is trivial using Emacs and org mode. You could then use pandoc to convert org to markdown

  • It's trivial using a number of approaches, eg. a simple bash or python script. But I think there's still a fair amount of value in building a common tool for these sorts of things. Everyone that builds their own one off solution will inevitably encounter more and more of the edge cases (oh I need to honor .gitignore... oh, I need to be able to override .gitignore and include some ignored things... oh I need to deal with huge files... etc) and with a common tool the tool can collect the ways of dealing with all of these edge cases.

    Now no one will need something that can handle all of the edge cases, but whatever edge cases they need to be handled will already be handled. The overall time and frustration saved this way can be huge.

How do you do the opposite of this? Transform your markdown files into a codebase that AI can't leech off of?

Damn, I did that the other day but manually. I just cat everything from a folder in the order that I wanted and fed it to ChatGPT so it could write a README for tiny.js

I built a simple tool to do something similar (it's meant for a monorepo and will build each subfolder in to a (subfolder-code.txt) text file that you can upload to AIs.

https://github.com/manfrin/bundle-codebases

I don't see much merit in things like markdown or syntax highlighting as that's just extra noise for the AI. My script tries to cut down on any extraneous data since the things I'm working on are near the context limit of consumer AIs.

My script also ignores anything in .gitignore and will take a .codebundlerwhitelist (i hate this name and have meant to change it) to only bundle files matching patterns you specify.

how does this compare to code2prompt or files2prompt ? any benchmarks on which one works better for LLMs ?

So only the US is allowed to get data directly from the companies.

Got it.