Comment by vitaelabitur

2 months ago

I tokenized these and they seem to use around 20% less tokens than the original JSONs. Which makes me think a schema like this might optimize latency and costs in constrained LLM decoding.

I know that LLMs are very familiar with JSON, and choosing uncommon schemas just to reduce tokens hurts semantic performance. But a schema that is sufficiently JSON-like probably won't disrupt model path/patterns that much and prevent unintended bias.

3 comments

vitaelabitur

nurumaik 2 months ago

Minified json would use even less tokens

vitaelabitur 2 months ago
Yeah, but I tried switching to minified JSON on a semantic labelling task and saw a ~5% accuracy drop.
I suspect this happened because most of the pre-training corpus was pretty-printed JSON, and the LLM was forced to derail from likely path and also lost all "visual cues" of nesting depth.
This might happen here too, but maybe to a lesser extent. Anyways, I'll stop building castles in the air now and try it sometime.
- memoriuaysj 2 months ago
  
  if you really care about structured output switch to XML. much better results, which is why all AI providers tend to use pseudo-xml in their system prompts and tool definitions