Comment by refibrillator

2 months ago

> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original

How close are we talking?

I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.

Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.

I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.

However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.

7 comments

refibrillator

danielhanchen 2 months ago

Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://news.ycombinator.com/item?id=39671146 etc are much more important :)

We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!

saurik 2 months ago

> All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜endofsentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100.
I couldn't tell if this was an error in the code running the model or in the model weights themselves; if/assuming the former, are these fixes being upstreamed to anywhere?

ryan_glass 2 months ago

You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.

danielhanchen 2 months ago
Thanks for using our quants and appreciate it :) - We're still doing internal benchmarks since they're very slow to do - but they definitely pass our internal benchmarks :)
- ryan_glass 2 months ago
  
  Thank you for making the dynamic quantisations! My setup wouldn't be possible without them and for my personal use, they do exactly what I need and are indeed excellent.
ysosirius 2 months ago
How do you find the quality of the output compares to that of, say, o3 or Sonnet 4?
- ryan_glass 2 months ago
  
  To be honest I haven't used o3 or Sonnet as the code I work with is my own proprietary code which I like to keep private, which is one reason for the local setup. For troubleshooting day to day things I have found it at least as good as than the free in-browser version of ChatGPT (not sure which model it uses).