llama.cpp supports grammar limiting using either GBNF or json schema (It just translate it to GBNF behind the scenes I think). So I have my harness generate a tool schema on the fly (based on what tools are possible for the current task) and pass it in at request time.
It's basically restricting what logits are allowed when sampling the model to conform with the JSON (or whatever) shape. It can also cause the model to get "confused" though and doesn't always result in the output you want.
https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
llama.cpp supports grammar limiting using either GBNF or json schema (It just translate it to GBNF behind the scenes I think). So I have my harness generate a tool schema on the fly (based on what tools are possible for the current task) and pass it in at request time.
Oh, interesting - thanks for the link. I really haven't explored this but it should slot in fairly easily I think? Gotta dig into it more.
It's basically restricting what logits are allowed when sampling the model to conform with the JSON (or whatever) shape. It can also cause the model to get "confused" though and doesn't always result in the output you want.