Comment by danielhanchen
2 days ago
For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.
./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"
More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1
> More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1
Was that document almost exclusively written with LLMs? I looked at it last night (~8 hours ago) and it was riddled with mistakes, most egregious was that the "Run with Ollama" section had instructions for how to install Ollama, but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
Do you have any plans on disclosing how much of these docs are written by humans vs not?
Regardless, thanks for the continued release of quants and weights :)
Oh hey sorry the docs are still in construction! Are you referring to merging GGUFs to Ollama - it should work fine! Ie:
``` ./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf ```
Ollama can only allow merged GGUFs (not splitted ones), so hence the command.
All docs are made by humans (primarily my brother and me), just sometimes there might be some typos (sorry in advance)
I'm also uploading Ollama compatible versions directly so ollama run can work (it'll take a few more hours)
> but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
But in the docs I see things like
Wouldn't this explain that? (Didn't look too deep)
Yes it's probs the ordering od the docs thats the issue :) Ie https://docs.unsloth.ai/basics/deepseek-v3.1#run-in-llama.cp... does:
```
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
```
but then Ollama is above it:
```
./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf
```
I'll edit the area to say you first have to install llama.cpp
By the way, I'm wondering why unsloth (a goddamn python library) tries to run apt-get with sudo (and fails on my nixos). Like how tf we are supposed to use that?
Oh hey I'm assuming this is for conversion to GGUF after a finetune? If you need to quantize to GGUF Q4_K_M, we have to compile llama.cpp, hence apt-get and compiling llama.cpp within a Python shell.
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
It seems Unsloth is useful and popular, and you seem responsive and helpful. I'd be down to try to improve this and maybe package Unsloth for Nix as well, if you're up for reviewing and answering questions; seems fun.
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
Is either sort of change potentially agreeable enough that you'd be happy to review it?
22 replies →
I'll venture that whoever is going to fine-tune their own models probably already has llama.cpp installed somewhere, or can install if required.
Please, please, never silently attempt to mutate the state of my machine, that is not a good practice at all and will break things more often than it will help because you don't know how the machine is set up in the first place.
3 replies →
Dude, this is NEVER ok. What in the world??? A third party LIBRARY running sudo commands? That’s just insane.
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
23 replies →
It won't work well if you deal with non ubuntu+cuda combination. Better just fail with a reasonable message.
5 replies →
[flagged]
8 replies →
hey fellow crazy person! slight tangent: one thing that helps keep me grounded with "LLMs are doing much more than regurgitation" is watching them try to get things to work on nixos - and hitting every rake on the way to hell!
nixos is such a great way to expose code doing things it shouldn't be doing.
In my experience LLMs can do Nix very well, even the models I run locally. I just instruct them to pull dependencies through flake.nix and use direnv to run stuff.
1 reply →
I'm glad someone commented and tried it out - appreciate it immensely - I learnt a lot today :) I'm definitely gonna give nixos a spin as well!
Thanks for your great work with quants. I would really appreciate UD GGUFs for V3.1-Base (and even more so, GLM-4.5-Base + Air-Base).
Thanks! Oh base models? Interesting since I normally do only Instruct models - I can take a look though!
It’d also be great if you guys could do a fine tune to run on an 8x80G A/H100. These H200/B200 configs are harder to come by (and much more expensive).
Unsloth should work on any GPU setup all the way until the old Tesla T4s and the newer B200s :) We're working on a faster and better multi GPU version, but using accelerate / torchrun manually + Unsloth should work out of the box!
I guess I was hoping for you guys to put up these weights. I think they’d be popular for these very large models.
You guys already do a lot for the local LLM community and I appreciate it.
1 reply →
>250GB, how do you guys run this stuff?
I'm working on sub 165GB ones!
165GB will need a 24GB GPU + 141GB of RAM for reasonably fast inference or a Mac
for such dynamic 2bit, is there any benchmark results showing how many performance I would give up compared to the original model? thanks.
Currently no, but I'm running them! Some people on the aider discord are running some benchmarks!
@danielhanchen do you publish the benchmarks you run anywhere?
1 reply →
if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware. I for instance often ran Q3_K_L, I don't think of how much performance I'm giving up, but rather how without Q3, I won't be able to run it at all. With that said, for R1, I did some tests against 2 public interfaces and my local Q3 crushed them. The problem with a lot of model providers is we can never be sure what they are serving up and could take shortcuts to maximize profit.
Oh Q3_K_L as in upcasted embed_tokens + lm_head to Q8_0? I normally do Q4 embed Q6 lm_head - would a Q8_0 be interesting?
That's true only in a vacuum. For example, should I run gpt-oss-20b unquantized or gpt-oss-120b quantaized? Some models have a 70b/30b spread, and that's only across a single base model, where many different models exist at different quants could be compared for different tasks.
3 replies →