Comment by fine_tune
2 days ago
RAG is taking a bunch of docs, chunking them it to text blocks of a certain length (how best todo this up for debate), creating a search API that takes query (like a google search) and compares it to the document chunks (very much how your describing). Take the returned chunks, ignore the score from vector search, feed those chunks into a re-ranker with the original query (this step is important vector search mostly sucks), filter those re-ranked for the top 1/2 results and then format a prompt like;
The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)
Doc1.pdf - Chunk N Eat cheese
Doc2.pdf- Chunk Y Dont eat cheese
You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.
But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.
To be fair though, its pretty powerful once you've got in place.
Sorry for my lack of knowledge, but I've been wondering what if you ask a question to the RAG, where the answer to the question is not close in embedding space to the embedded question? Will that not limit the quality of the result? Or how does a RAG handle that? I guess maybe the multi-turn convo you mentioned helps in this regard?
The way I see RAG is it's basically some sort of semantic search, where the query needs to be similar to whatever you are searching for in the embedding space order to get good results.
I think the trick is called "query expansion". You use an LLM to rewrite the query into a more verbose form, which can also include text from the chat context, and then you use that as the basis for the RAG lookup. Basically you use an LLM to give the RAG a better chance of having the query be similar to the resources.
Or you find your users search for id strings like k1231o to find ref docs and end up needing key word search and reranking.
Is RAG how I would process my 20+ year old bug list for a piece of software I work on?
I've been thinking about this because it would be nice to have a fuzzier search.
Yes and no, for human search - its kinda neat, you might find some duplicates, or some nearby neighbour bugs that help you solve a whole class of issues.
But the cool kids? They'd do something worse;
They'd define some complicated agentic setup that cloned your code base into containers firewalled off from the world, give prompts like;
Your expert software dev in MY_FAVE_LANG, here's a bug description 'LONG BUG DESCRIPTION' explore the code and write a solution. Here's some tools (read_file, write_file, ETC)
You'd then spawn as many of these as you can, per task, and have them all generate pull requests for the tasks. Review them with an LLM, then manually and accept PR's you wanted. Now your in the ultra money.
You'd use RAG to guide an untuned LLM on your code base for styles and how to write code. You'd write docs like "how to write an API, how to write a DB migration, ETC" and give that as tool to the agents writing the code.
With time and effort, you can write agents to be specific to your code base through fine tuning, but who's got that kind of money?
I feel called out, lmao. I’m building an agentic framework for automated pentesting as part of an internal AppSec R&D initiative. My company’s letting me run wild with infrastructure and Bedrock usage (bless their optimism). I’ve been throwing together some admittedly questionable prototypes to see what sticks.
The setup is pretty basic: S3 for docs and code base, pgvector on RDS for embeddings, Claude/Titan for retrieval and reasoning. It works in the sense that data flows through and responses come out… but the agents themselves are kind of a mess.
They think they’ve found a bug, usually something like a permissive IAM policy or a questionable API call, and just latch onto it. They tunnel hard, write up something that sounds plausible, and stop there. No lateral exploration, no attempt to validate anything in a dev environment despite having MCP tools to access internal resources, and definitely no real exploitation logic.
I’ve tried giving them tools like CodeQL, semgrep and Joern, but that’s been pretty disappointing. They can run basic queries, but all they surface are noisy false positives, and they can’t reason their way out of why it might be a false positive early on. There’s no actual taint analysis or path tracing, just surface-level matching and overconfident summaries. I feel like I’m duct-taping GPT-4 to a security scanner and hoping for insight.
I’ve experimented with splitting agents into roles (finder, validator, PoC author, code auditor, super uber hacker man), giving them memory, injecting skepticism, etc., but it still feels like I’m missing something fundamental.
If cost isn’t an issue, how would you structure this differently? How do you actually get agents to do persistent, skeptical, multi-stage analysis, especially in security contexts where you need depth and proof, not just plausible-sounding guesses and long ass reports on false positives?
1 reply →
You'd be surprised how many people are actually doing this exact kind of solutioning.
It's also not that costly to do if you think about the problem correctly
If you continue down the brute forcing route you can do mischievous things like sign up for thousands and thousands of free accounts across numerous network connections to LLM APIs and plug away
You could try just exporting it as one text or XML file and seeing if it fits in Genini's context
I don't think it will. Gemini Pro has a context window of 2 million tokens which they say translates to around 1.5 million words. We have on the order of 100,000 logged issues and a typical issue description is around 500 words.