Comment by giancarlostoro
3 days ago
One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.
Here's a short clip of Karpathy speaking on this subject.
https://youtu.be/UldqWmyUap4
Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).
Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)
It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`
While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).
Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?
If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.
Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
If I understand correctly, LoRa can be applied to LLMs
Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.
When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.
Unfortunately reasoning ability depends on (or is enabled by) information intake during training. A model will know better what to search for and how to interpret it if the information was part of the training. So there is a trade off. Still I think the question is a practical one. Perhaps there are ideas to focus training on a) reasoning / conceptual modeling and b) reliance on external memory (search etc.) rather than internal memorization.
Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.
Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.
I remember reading tht hallucination is still a problem even with perfect context. You build a theoretical perfect RAG, give the LLM the exact correct information, and it will still make mistakes surprisingly often.
this was my experience as of about 6 months ago, and i don't believe that hallucinating is a solved problem as of yet
I feel like I should say "spoiler alert" but:
> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?
It depends what that word "reasonable" means for your specific use-case ;)
> validating outputs for LLM companies
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.
I think the idea is to train a small, minimal LLM thinking model that can run on edge devices, but that has very little knowledge embedded in its weights, and so performs a sort of RAG to Encylopedia Britannica to ground answers to user queries.
Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call