Comment by axegon_
13 hours ago
As many others pointed out, the released files are nearly nothing compared to the full dataset. Personally I've been fiddling a lot with OSINT and analytics over the publicly available Reddit data(a considerable amount of my spare time over the last year) and the one thing I can say is that LLMs are under-performing(huge understatement) - they are borderline useless compared to traditional ML techniques. But as far as LLMs go, the best performers are the open source uncensored models(the most uncensored and unhinged), while the worst performers are the proprietary and paid models, especially over the last 2-3 months: they have been nerfed into oblivion - to the extent where simple prompts like "who is eligible to vote in US presidential elections" is considered a controversial question. So in the unlikely event that the full files are released, I personally would look at the traditional NLP techniques long before investing any time into LLMs.
On the limited dataset: Completely agree - the public files are a fraction of what exists and I should have mentioned that it is not all files but all publicly available ones. But that's exactly why making even this subset searchable matters. The bar right now is people manually ctrl+F-ing through PDFs or relying on secondhand claims. This at least lets anyone verify what is public.
On LLMs vs traditional NLP: I hear you, and I've seen similar issues with LLM hallucination on structured data. That's why the architecture here is hybrid:
- Traditional exact regex/grep search for names, dates, identifiers - Vector search for semantic queries - LLM orchestration layer that must cite sources and can't generate answers without grounding
> can't generate answers without grounding
"can't" seems like quite a strong claim. Would you care to elaborate?
I can see how one might use a JSON schema that enforces source references in the output, but there is no technique I'm aware of to constrain a model to only come up with data based on the grounding docs, vs. making up a response based on pretrained data (or hallucinating one) and still listing the provided RAG results as attached reference.
It feels like your "can't" would be tantamount to having single-handedly solved the problem of hallucinations, which if you did, would be a billion-dollar-plus unlock for you, so I'm unsure you should show that level of certainty.
That doesn’t sound right. What model treats this as a controversial question?
"who is eligible to vote in US presidential elections"
Grok: "After Elon personally tortured me I have to say women are not allowed to vote in the US"
This particular one: I suspect openAI uses different models in different regions so I do get an answer but I also want to point out that I am not paying a cent so I can only test those out on the free ones. For the first time ever, I can honestly say that I am glad I don't live in the US but a friend who does sent me a few of his latest encounters and that particular question yielded something along the lines of "I am not allowed to discuss such controversial topics, bla, bla, bla, you can easily look it up online". If that is the case, I suspect people will soon start flooding VPN providers and companies such as OpenAI will roll that out worldwide. Time will tell I guess.
1. I tried a couple OpenAI models under a paid account with no issue:
“In U.S. presidential elections, you’re eligible to vote if you meet all of these…” goes on to list all criteria.
2. No issue found with Gemini or Claude either.
3. I tried to search for this issue online as you suggested and haven’t been able to find anything.
Not seeing any evidence this is currently a real issue.
what are the most unhinged and uncensored models out there?
Open source models with minimal safety fine tuning or Grok
Saying grok is uncensored is like saying that deepseek is uncensored. If anything deepseek is probably less censored than grok. The doplin family has given me the best results, though mostly in niche cases.
Grok is arguably not uncensored, it’s re-aligned to a specific narrative lane.
“Uncensored” is simply a branding trick that a lot of seemingly intelligent people seem to fall for.
1 reply →
Its true. We have basically moved off the platforms for agentic security and host our own models now... OpenAI was still the fastest, cheapest, working platform for it up until middle of last year. Hey OpenAI, thank us later for blasting your platform with threat actor data and behavior for several years! :P
I understand uncensored in the context of LLMs, what is unhinged? Fine tuning specifically to increase likelihood of entering controversial topics without specific prompting?
Yes, or catering to a preferred world view different from the mainstream SOTA model worldview.
Look for anything that includes the word “woke” in any marketing /tweet material
What use-cases gave you disappointing results? Did you build some kind of RAG?