← Back to context

Comment by HanClinto

1 day ago

This is a seriously beautiful guide. I really appreciate you putting this together! I especially love the tab-through animations on the various pages, and this is one of the best explanations that I've seen. I generally feel I understand grammar-constrained generation pretty well (I've merged a handful of contributions to the llama.cpp grammar implementation), and yet I still learned some insights from your illustrations -- thank you!

I'm also really glad that you're helping more people understand this feature, how it works, and how to use it effectively. I strongly believe that structured outputs are one of the most underrated features in LLM engines, and people should be using this feature more.

Constrained non-determinism means that we can reliably use LLMs as part of a larger pipeline or process (such as an agent with tool-calling) and we won't have failures due to syntax errors or erroneous "Sure! Here's your output formatted as JSON with no other text or preamble" messages thrown in.

Your LLM output might not be correct. But grammars ensure that your LLM output is at least _syntactically_ correct. It's not everything, but it's not nothing.

And especially if we want to get away from cloud deployments and run effective local models, grammars are an incredibly valuable piece of this. For practical examples, I often think of Jart's example in her simple LLM-based spam-filter running on a Raspberry Pi [0]:

> llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \ > --grammar 'root ::= "yes" | "no"' --temp 0 -c 0 \ > --no-display-prompt --log-disable -p "<|user|> > Can you say for certain that the following email is spam? ...

Even though it's a super-tiny piece of hardware, by including a grammar that constrains the output to only ever be "yes" or "no" (it's impossible for the system to produce a different result), then she can use a super-small model on super-limited hardware, and it is still useful. It might not correctly identify spam, but it's never going to break for syntactic reasons, which gives a great boost to the usefulness of small, local models.

* [0]: https://justine.lol/matmul/

> strongly believe that structured outputs are one of the most underrated features in LLM engines

Structured output is really the whole foundation of lots of our hopes and dreams. The JSONSchemaBench paper is fairly preoccupied with performance, but where it talks about quality/compliance, the "LM only" scores in the tables are pretty bad. This post highlights the on-going difficulty and confusion around doing a simple, necessary, and very routine task well.

Massaging small inputs into structured formats isn't really the point. It's about all the nontrivial cases central to MCP, tool-use, local or custom APIs. My favorite example of this is every tool-use tutorial that's pretending that "ping" accepts 2 arguments, but, it's actually more like 20 arguments with subtle gotchas. Do the tool-use demos that correctly work with 2 arguments actually work with 20? How many more retries might that take, and what does this change about the hardware and models we need for "basic" stuff?

If you had a JSON schema correctly and completely describing legal input for say ffmpeg, then the size and complexity of it would be approaching that of kubernetes schemas (where JSONBench compliance is only at .56). Can you maybe yolo generate a correct ffmpeg command without consulting any schema with SOTA models? Of course!, but that works well because ffmpeg is a well-documented tool with decades of examples floating around in the wild. What's the arg-count and type-complexity for that one important function/class in your in-house code base? For a less well-known use case or tool, if you want hallucination free and correct output, then you need structured output that works, because the alternative is rolling your own model trained on your stuff.

What does it do when the model wants to return something else, and what's better/worse about doing it in llamafile vs whatever wrapper that's calling it? How do I set retries? What if I want JSON and a range instead?

  • You can't do it as part of whatever's calling it because this changes the sampler. The grammar constraints what tokens the sampler is allowed to consider, only passing tokens that are valid by the grammar.

  • > What does it do when the model wants to return something else,

    You can build that into your structure, same as you would for allowing error values to be returned from a system.

  • There are no retries. The grammar enforces the output tokens accepted as part of llamacpp.