Comment by dragonwriter
3 days ago
> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
This is a basic question but maybe you can help: what is a good resource to use to understand how to take advantage of logits?
https://dottxt-ai.github.io/outlines/latest/
For OpenAI, you can just pass in the json_schema to activate it, no library needed. For direct LLM interfacing you will need to host your own LLM or use a cloud provider that allows you too hook in, but someone else may need to correct me on this.
If anyone is using anything other than Outlines, please let us know.
1 reply →
Thanks for the explanation!