Comment by jodrellblank
1 year ago
Try one and find out. Look at https://github.com/Mozilla-Ocho/llamafile/ Quickstart section; download a single cross-platform ~3.7GB file and execute it, it starts a local model, local webserver, and you can query it.
See it demonstrated in a <7 minute video here: https://www.youtube.com/watch?v=d1Fnfvat6nM
The video explains that you can download the larger models on that Github page and use them with other command line parameters, and shows how you can get a Windows + nVidia setup to GPU accelerate the model (install CUDA and MSVC / VS Community edition with C++ tools, run for the first time from MSVC x64 command prompt so it can build a thing using cuBLAS, rerun normally with "-ngl 35" command line parameter to use 3.5GB of GPU memory (my card doesn't have much)).
GPU bits have changed! I just noticed in the video description:
"IMPORTANT: This video is obsolete as of December 26, 2023 GPU now works out of the box on Windows. You still need to pass the -ngl 35 flag, but you're no longer required to install CUDA/MSVC."
So that's convenient.