Comment by shmolyneaux
1 day ago
Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.
What have folks tried?
Yes, that's the purpose of TOON.
https://github.com/toon-format/toon
Is there evidence that LLMs adhere to this format better than to JSON? I doubt that.
Their benchmarks compare it against other formats as input, not as output.
1 reply →
It is 100% guaranteed that they DON'T. Toon is 3 months old, it's not used by anyone, and it's therefore not in the training set of any model.
Nice, it would be good idea to develop CFG for this as well so can embed it into all these constrained decoding libraries
Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.
So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?
Generating code that when ran generates JSON works well if you design builder functions thoughtfully. Takes fewer tokens too.
We're working on an agentic content transformation pipeline based on markdown with YAML metadata in the front matter. I'm a bit worried about the lack of tooling with respect to JSON payloads but then again it's not that hard to parse and then convert to JSON to validate against a schema.
I use regex to force an XML schema and then use a normal XML parser to decode.
XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.
OpenAI API lets you do regex structured output and it's much better than JSON for code.
Could you share some samples / pointers on how you do this?
Yeah, this upsert_cell tool does it
https://observablehq.com/@tomlarkworthy/forking-agent#upsert...
format: { type: "grammar", syntax: "regex", definition: cellsRegex },
Where cellRegex is
cellsRegex = { const CELL_OPEN = String.raw`<cell>\s`;
}
And the extraction logic is here https://observablehq.com/@tomlarkworthy/robocoop-2#process
function process(content) { const doc = domParser.parseFromString( "<response>" + content + "</response>", "text/xml" ); const cells = [...doc.querySelectorAll("cell")]; return cells.map((cell) => { const inputsContent = cell.querySelector("inputs")?.textContent || ""; return { inputs: inputsContent.length > 0 ? inputsContent.split(",").map((s) => s.trim()) : [], code: (cell.querySelector("code")?.textContent || "").trim() }; }); }
BTW that agent is under development and not actually that good at programming. Its parent https://observablehq.com/@tomlarkworthy/robocoop-2 is actually very good at notebook programming
You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).