Comment by rhdunn
5 hours ago
If you don't want the thinking, you can pass `enable_thinking: false` to the `chat_template_kwargs`. If using promptfoo, this can be done via:
providers:
- # llama-server
id: openai:chat:qwen
config:
apiBaseUrl: http://localhost:7876
apiKey: "..."
passthrough:
chat_template_kwargs:
enable_thinking: false
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)
No comments yet
Contribute on Hacker News ↗