Comment by crispyambulance
1 year ago
Kind of embarrassed to ask, I use AI a lot, I haven't really understood how the nuts and bolts work (other than at a 5th-grader 30000ft level)...
So, when I use a "full" AI like chatGPT4o, I ask it questions and it has a firm grip on a vast amount of knowledge, like, whole-internet/search-engine scope knowledge.
If I run an AI "locally", on even a muscular server, it obviously does NOT have vast amounts of stored information about everything. So what use is it to run locally? Can I just talk to it as though it were a very smart person who, tragically, knows nothing?
I mean, I suppose I could point it to a NAS box full of pdf's and ask questions about that narrow range of knowledge, or maybe get one of those downloaded wikipedia stores. Is that what folks are doing? It seems like you would really need a lot content for the AI to even be remotely useable like the online versions.
Running it locally it will still have the vast/"full Internet" knowledge.
This is probably one of the most confusing things about LLMs. They are not vast archives of information and the models do not contain petabytes of copied data.
This is also why LLMs are so often wrong. They work by association, not by recall.
Try one and find out. Look at https://github.com/Mozilla-Ocho/llamafile/ Quickstart section; download a single cross-platform ~3.7GB file and execute it, it starts a local model, local webserver, and you can query it.
See it demonstrated in a <7 minute video here: https://www.youtube.com/watch?v=d1Fnfvat6nM
The video explains that you can download the larger models on that Github page and use them with other command line parameters, and shows how you can get a Windows + nVidia setup to GPU accelerate the model (install CUDA and MSVC / VS Community edition with C++ tools, run for the first time from MSVC x64 command prompt so it can build a thing using cuBLAS, rerun normally with "-ngl 35" command line parameter to use 3.5GB of GPU memory (my card doesn't have much)).
GPU bits have changed! I just noticed in the video description:
"IMPORTANT: This video is obsolete as of December 26, 2023 GPU now works out of the box on Windows. You still need to pass the -ngl 35 flag, but you're no longer required to install CUDA/MSVC."
So that's convenient.
The LLMs have the ‘knowledge’ baked in, one of the things you will hear about are quantized models with lower precision (think 16-bit -> 4-bit) weights, which enables them to be run on greater variety of hardware and/or with greater performance.
When you quantize, you sacrifice model performance. In addition, a lot of the models favored for local use are already very small (7b, 3b).
What OP is pointing out is that you can actually run the full deepseek r1 model, along with all of the ‘knowledge’ on relatively modest hardware.
Not many people want to make that tradeoff when there are cheap, performant APIs around but for a lot of people who have privacy concerns or just like to tinker, it is pretty big deal.
I am far removed from having a high performance computer (although I suppose my MacBook is nothing to sneeze at), but I remember building computers or homelabs back in the day and then being like ‘okay now what is the most stressful workload I can find?!’ — this is perfect for that.
I've also been away from the tech (and AI scene) for a few years now. And I mostly stayed away from LLMs. But I'm certain that all the content is baked into the model, during training. When you query the model locally (since I suppose you don't train it yourself), you get all that knowledge that's baked into the model weights.
So I would assume the locally queried output to be comparable with the output you get from an online service (they probably use slightly better models, I don't think they release their latest ones to the public).
It's all in the model. If you look for a good definition of "intellingence" that is compression. You can see ZIP algorithm as a primordial antenate of Chatgpt :))
Most of an AI's knowledge is inside the weights, so when you run it locally, it has all that knowledge inside!
Some AI services allow the use of 'tools', and some of those tools can search the web, calculate numbers, reserve restaurants, etc. However, you'll typically see it doing that in the UI.
Local models can do that too, but it's typically a bit more setup.
LLMs are to a good approximation zip files intertwined with... magic... that allows the compressed data to be queried with plain old English - but you need to process all[0] the magic through some special matrix mincers together with the query (encoded as a matrix, too) to get an answer.
[0] not true but let's ignore that for a second
The knowledge is stored in the model, the one mentioned here is rather large, the full version needs over 700GB of disk space. Most people use compressed versions, but even those will often be 10-30GB in size.
THAT is rather shocking. Vastly smaller than I would expect.
I ask this all the time. locally running an LLM seems super hobbyist to me. like tweaking terminal font sizes on fringe BSD distros kind of thing
Privacy.