Comment by vitaelabitur
3 hours ago
I tokenized these and they seem to use around 20% less tokens than the original JSONs. Which makes me think a schema like this might optimize latency and costs in constrained LLM decoding.
I know that LLMs are very familiar with JSON, and choosing uncommon schemas just to reduce tokens hurts semantic performance. But a schema that is sufficiently JSON-like probably won't disrupt model path/patterns that much and prevent unintended bias.
Minified json would use even less tokens
Yeah, but I tried switching to minified JSON on a semantic labelling task and saw a ~5% accuracy drop.
I suspect this happened because most of the pre-training corpus was pretty-printed JSON, and the LLM was forced to derail from likely path and also lost all "visual cues" of nesting depth.
This might happen here too, but maybe to a lesser extent. Anyways, I'll stop building castles in the air now and try it sometime.
if you really care about structured output switch to XML. much better results, which is why all AI providers tend to use pseudo-xml in their system prompts and tool definitions