Comment by danielhanchen
17 hours ago
Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok
17 hours ago
Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok
Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s.
It takes download time + 1 minute to test speed yourself, you can try different quants, it's hard to write down a table because it depends on your system ie. ram clock etc. if you go out of gpu.
I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.
Testing speed is easy yes, I'm mostly wondering about the quality difference between Q6 vs Q8_K_XL for example.
1 reply →