Comment by shmolyneaux

1 month ago

Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.

What have folks tried?

13 comments

shmolyneaux

marquesine 1 month ago

Yes, that's the purpose of TOON.

https://github.com/toon-format/toon

koakuma-chan 1 month ago
Is there evidence that LLMs adhere to this format better than to JSON? I doubt that.
- iLoveOncall 1 month ago
  
  It is 100% guaranteed that they DON'T. Toon is 3 months old, it's not used by anyone, and it's therefore not in the training set of any model.
- TheTaytay 1 month ago
  
  Their benchmarks compare it against other formats as input, not as output.
  
  1 reply →
prats226 1 month ago

Nice, it would be good idea to develop CFG for this as well so can embed it into all these constrained decoding libraries

greiskul 1 month ago

Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.

So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?

kaaloo 1 month ago

We're working on an agentic content transformation pipeline based on markdown with YAML metadata in the front matter. I'm a bit worried about the lack of tooling with respect to JSON payloads but then again it's not that hard to parse and then convert to JSON to validate against a schema.

tlarkworthy 1 month ago

I use regex to force an XML schema and then use a normal XML parser to decode.

XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.

OpenAI API lets you do regex structured output and it's much better than JSON for code.

psadri 1 month ago
Could you share some samples / pointers on how you do this?
- tlarkworthy 1 month ago
  
  Yeah, this upsert_cell tool does it
  https://observablehq.com/@tomlarkworthy/forking-agent#upsert...
  format: { type: "grammar", syntax: "regex", definition: cellsRegex },
  Where cellRegex is
  cellsRegex = { const CELL_OPEN = String.raw`<cell>\s`;
  const INPUTS_BLOCK = String.raw`<inputs>.*<\/inputs>\s*`; const CODE_BLOCK = String.raw`<code><!\[CDATA\[[\s\S]*\]\]>\s*<\/code>\s*`; const CELL_CLOSE = String.raw`<\/cell>`; return "^(" + CELL_OPEN + INPUTS_BLOCK + CODE_BLOCK + CELL_CLOSE + ")*$";
  }
  And the extraction logic is here https://observablehq.com/@tomlarkworthy/robocoop-2#process
  function process(content) { const doc = domParser.parseFromString( "<response>" + content + "</response>", "text/xml" ); const cells = [...doc.querySelectorAll("cell")]; return cells.map((cell) => { const inputsContent = cell.querySelector("inputs")?.textContent || ""; return { inputs: inputsContent.length > 0 ? inputsContent.split(",").map((s) => s.trim()) : [], code: (cell.querySelector("code")?.textContent || "").trim() }; }); }
  BTW that agent is under development and not actually that good at programming. Its parent https://observablehq.com/@tomlarkworthy/robocoop-2 is actually very good at notebook programming

orbital-decay 1 month ago

You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).

max2 1 month ago

Generating code that when ran generates JSON works well if you design builder functions thoughtfully. Takes fewer tokens too.