← Back to context

Comment by shmolyneaux

1 day ago

Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.

What have folks tried?

Yes, that's the purpose of TOON.

https://github.com/toon-format/toon

  • Nice, it would be good idea to develop CFG for this as well so can embed it into all these constrained decoding libraries

Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.

So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?

Generating code that when ran generates JSON works well if you design builder functions thoughtfully. Takes fewer tokens too.

We're working on an agentic content transformation pipeline based on markdown with YAML metadata in the front matter. I'm a bit worried about the lack of tooling with respect to JSON payloads but then again it's not that hard to parse and then convert to JSON to validate against a schema.

I use regex to force an XML schema and then use a normal XML parser to decode.

XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.

OpenAI API lets you do regex structured output and it's much better than JSON for code.

  • Could you share some samples / pointers on how you do this?

    • Yeah, this upsert_cell tool does it

      https://observablehq.com/@tomlarkworthy/forking-agent#upsert...

      format: { type: "grammar", syntax: "regex", definition: cellsRegex },

      Where cellRegex is

      cellsRegex = { const CELL_OPEN = String.raw`<cell>\s`;

        const INPUTS_BLOCK = String.raw`<inputs>.*<\/inputs>\s*`;
      
        const CODE_BLOCK = String.raw`<code><!\[CDATA\[[\s\S]*\]\]>\s*<\/code>\s*`;
      
        const CELL_CLOSE = String.raw`<\/cell>`;
      
        return "^(" + CELL_OPEN + INPUTS_BLOCK + CODE_BLOCK + CELL_CLOSE + ")*$";

      }

      And the extraction logic is here https://observablehq.com/@tomlarkworthy/robocoop-2#process

      function process(content) { const doc = domParser.parseFromString( "<response>" + content + "</response>", "text/xml" ); const cells = [...doc.querySelectorAll("cell")]; return cells.map((cell) => { const inputsContent = cell.querySelector("inputs")?.textContent || ""; return { inputs: inputsContent.length > 0 ? inputsContent.split(",").map((s) => s.trim()) : [], code: (cell.querySelector("code")?.textContent || "").trim() }; }); }

      BTW that agent is under development and not actually that good at programming. Its parent https://observablehq.com/@tomlarkworthy/robocoop-2 is actually very good at notebook programming

You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).