Comment by giancarlostoro

4 months ago

One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.

18 comments

giancarlostoro

andai 4 months ago

Here's a short clip of Karpathy speaking on this subject.

https://youtu.be/UldqWmyUap4

Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).

Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)

intrasight 4 months ago

It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).

And I don't think that LLM could just Google or check Wikipedia.

But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.

ramses0 4 months ago

I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`
While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).
Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?
If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.
giancarlostoro 4 months ago
Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
- dpflug 4 months ago
  
  If I understand correctly, LoRa can be applied to LLMs

embedding-shape 4 months ago

Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.

giancarlostoro 4 months ago

When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.

bee_rider 4 months ago

Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.

giancarlostoro 4 months ago

Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.
andai 4 months ago
I remember reading tht hallucination is still a problem even with perfect context. You build a theoretical perfect RAG, give the LLM the exact correct information, and it will still make mistakes surprisingly often.
- Natfan 4 months ago
  
  this was my experience as of about 6 months ago, and i don't believe that hallucinating is a solved problem as of yet

krychu 4 months ago

Unfortunately reasoning ability depends on (or is enabled by) information intake during training. A model will know better what to search for and how to interpret it if the information was part of the training. So there is a trade off. Still I think the question is a practical one. Perhaps there are ideas to focus training on a) reasoning / conceptual modeling and b) reliance on external memory (search etc.) rather than internal memorization.

utopiah 4 months ago

> validating outputs for LLM companies

How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.

rablackburn 4 months ago

I feel like I should say "spoiler alert" but:

> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?

It depends what that word "reasonable" means for your specific use-case ;)

thinkingtoilet 4 months ago

Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.

davidron 4 months ago

It's perfectly legal to train a human on copyrighted work and I think, depending on the country, it's not settled that training ai on the same data is illegal.
naasking 4 months ago

I think the idea is to train a small, minimal LLM thinking model that can run on edge devices, but that has very little knowledge embedded in its weights, and so performs a sort of RAG to Encylopedia Britannica to ground answers to user queries.

uniq7 4 months ago

Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call