Comment by aesthesia
14 hours ago
Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?
14 hours ago
Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?
Sure, but does llama-cpp support that?