Comment by hansvm
22 days ago
This is good. It covers the two easiest dominant methods people use. It even touches on my main complaint for the one they seem to recommend.
That said:
- Constrained generation yields a different distribution from what a raw LLM would provide. This can be pathologically bad. My go-to example is LLMs having a preference for including ellipses in long, structured objects. Constrained generation forces closing quotes or whatever it takes to recover from that error according to a schema, nevertheless yielding an invalid result. Resampling tends to repeat till the LLM fully generates the data in question, always yielding a valid result which also adheres to the schema. It can get much worse than that.
- The unconstrained "method" has a few possible implementations. Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes. Effective context windows are precious, and current models bias heavily toward earlier data which has been fed into them. In a low-error regime you might get away with a "try it again" response in a single chat, but in a high-error regime you'll get better results at a lower cost by literally re-sending the same prompt till the model doesn't cause errors.
> Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes.
Another way to do this is to use a hybrid approach. You perform unconstrained generation first, and then constrained generation on the failures.
There's no difference in the output distribution between always doing constrained generation and only doing it on the failures though. What's the advantage?
There's no advantage wrt output quality, but it can be more economical in some high-error regimes, with less LLM calls used in resampling (max 2 for most errors).
1 reply →
Regarding your first point, this makes me wonder if diffusion models will be the future of constrained decoding.
Perhaps. Would you mind elaborating on what you're envisioning?
In both cases (auto-regressive vs diffusive), you still have some process that's being followed, and the exact steps in the process are important to the result. If you constraint at each step then you get the equivalent of something like projected gradient descent (as an analogy) and aren't guaranteed the same solution. If you constrain as a post-processing phase then (a) diffusion wasn't required for the initial generation, and (b) that's still unlikely to converge to the same distribution (for similar reasons -- using my example of ellipsis errors, if you corrected that particular mistake in post then the closest valid messages to the initial generation are likely too short and thus still incorrect).