Comment by solarkraft

3 months ago

Looks like it: https://ollama.com/library/qwen3-vl:30b-a3b

1 comment

solarkraft

fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently

    llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5