Comment by solarkraft 3 months ago Looks like it: https://ollama.com/library/qwen3-vl:30b-a3b 1 comment solarkraft Reply thot_experiment 3 months ago fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5
thot_experiment 3 months ago fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5
fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently