← Back to context

Comment by bluejay2387

22 days ago

So about a year ago I wrote my own attempt at something like this using vector indexing and BM25 (the latest version uses CocoIndex, I had a custom coded solution using ChromaDB before). I wrote a comprehensive enough test set that showed performance increases on the quality of search results and reduction in token usage versus grep and rg. I haven't had time to really polish it but it worked well enough, particularly for one project where I have around 250k documentation files and docs out number code files 1000 to 1 (about 50% reduction in tokens and 30% increase in successful searches). Yesterday for grins I tried this project and was fairly disappointed to see it blow away my kludged solution particularly given that it doesn't have a lengthy indexing process. I haven't tested it on the 250k doc project yet, but in another project that I have a test suite for semantic search on it outperformed my solution by about 20% even on documentation in terms of successful search results (which I didn't expect given that it seems to only be tuned for code). I haven't gone through the code to see what its doing differently than what I tried, but what ever its doing it seems to have potential.

Wow, thanks for sharing, and cool that you're working on similar things! Feel free to drop any feedback on the repo if you want!