← Back to context

Comment by jellyotsiro

10 hours ago

On the limited dataset: Completely agree - the public files are a fraction of what exists and I should have mentioned that it is not all files but all publicly available ones. But that's exactly why making even this subset searchable matters. The bar right now is people manually ctrl+F-ing through PDFs or relying on secondhand claims. This at least lets anyone verify what is public.

On LLMs vs traditional NLP: I hear you, and I've seen similar issues with LLM hallucination on structured data. That's why the architecture here is hybrid:

- Traditional exact regex/grep search for names, dates, identifiers - Vector search for semantic queries - LLM orchestration layer that must cite sources and can't generate answers without grounding

> can't generate answers without grounding

"can't" seems like quite a strong claim. Would you care to elaborate?

I can see how one might use a JSON schema that enforces source references in the output, but there is no technique I'm aware of to constrain a model to only come up with data based on the grounding docs, vs. making up a response based on pretrained data (or hallucinating one) and still listing the provided RAG results as attached reference.

It feels like your "can't" would be tantamount to having single-handedly solved the problem of hallucinations, which if you did, would be a billion-dollar-plus unlock for you, so I'm unsure you should show that level of certainty.