Comment by libraryofbabel

1 month ago

Question for the well-informed people reading this thread: do SoTA models like Opus, Gemini and friends actually need output schema enforcement still, or has all the the RLVR training they do on generating code and json etc. made schema errors vanishingly unlikely? Because as a user of those models, they almost never make syntax mistakes in generating json and code; perhaps they still do output schema enforcement for "internal" things like tool call schemas though? I would just be surprised if it was actually catching that many errors. Maybe once in a while; LLMs are probabilistic after all.

(I get why you need structured generation for smaller LLMs, that makes sense.)

4 comments

libraryofbabel

runeblaze 1 month ago

Schemas can get pretty complex (and LLMs might not be the best at counting). Also schemas are sometimes the first way to guard against the stochasticity of LLMs.

With that said, the model is pretty good at it.

kleton 1 month ago

Yes. Most common failure mode for sota models is to put ```json\n first, but they often do just fail often enough to be worth calling api with json response schema.

XenophileJKO 1 month ago

1000% I was just doing some spot checking of GPT-5.2 for evaluating model migration and the tool I used didn't have the setup to use schema constrained inference.
The model is like: "Here is what I came up with... ```{json}``` and this is why I am proud of it!"

ineedasername 1 month ago

This is going to be task-dependent, as well as limited by your (the implementer's) ability and comfort with structuring the task in solid multi-shot prompts that cover a large distribution of expected inputs, which will also help increase the ability for the model to successfully handle less common or edge case inputs-- the ones the would most typically require human-level reasoning. It can be useful to supplement this with a "tool" use for RAG lookup against a more extensive store of examples, or any time the full reference material isn't practical to dump into context. This requires thoughtful chunking.

It also requires testing. Don't think of it as a magic machine that should be able to do anything, think of it like a new employee smart enough and with enough background knowledge to do the task, if given proper job documentation. Test whether few-shot or many shot prompting works better: there's growing information about use cases where one or the other confers an advantage but so much of this is task dependent.

Consider your tolerance for errors and plan some escalation method: Hallucinations occur in part because models "have to" give an answer. Make sure that any critical cases where an error would be problematic have some way for the model to bail out with "i don't know" for human review. The first layer of escalation doesn't even have to be a human, it could be a separate model, eg Opus instead of Sonnet, or the same model but with a different setup prompt explicitly designed for handling certain cases without cluttering up context of the first one. Splitting things in this way, if there's a logical break point, is also a great way to save on token cost: If you can send only 10k of tokens in a system prompt instead of 50k and just choose which of 5 10k prompts to use for different cases then you save 80% of upstream token $$.

Consider running the model deterministic: 0 temp, same seed. It makes any errors you encounter easier to trace and debug.

Something to consider with respect to cost though: Many tasks that a SoTA can do with very little or no scaffolding can be done with these cheaper models and may not take much more scaffolding. If a SoTA giving reliable responses with zero shot prompting there's a decent chance you can save a ton of money with a flash model if you provide it one or few shot prompts. Open weight models even more so.

My anecdotal experience is that open models like Google's gemma and OpenAI's gpt-oss have behaviors more similar to their paid counterparts than other open models, making them reasonable candidates to try if you're getting good results from the paid models but they're perhaps overkill for the task.