Comment by danielhanchen

6 months ago

For GPT OSS in particular, OpenAI only released the MoEs in MXFP4 (4bit), so the "unquantized" version is 4bit MoE + 16bit attention - I uploaded "16bit" versions to https://huggingface.co/unsloth/gpt-oss-120b-GGUF, and they use 65.6GB whilst MXFP4 uses 63GB, so it's not that much difference - same with GPT OSS 20B

llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)