Comment by nl
16 hours ago
It can.
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?
Sure, but does llama-cpp support that?