Comment by noosphr
5 days ago
The AI company I've been working at ran out of money last week so I'm taking a month long break.
I've been playing around with defining a standard that is easy to implement for serializing tabular data using the ASCII delimiters.
So far I've got:
<group> ::= GS | <record>
<record> ::= RS <group> | <unit>
<unit> ::= <high-ascii> | US <record>
<high-ascii> ::= 0x20 <unit> | ... | 0x7E <unit>
Which seems like a good way to avoid all the trouble of escaping separators in CSV files, if a bit clunky since you need to end each record with US RS and each file with US RS GS.
I also accidentally found another test that _all_ LLMs fail at (including all the reasoning models): the ability to decide if a given string is derivable from a grammar. I was asking for tests before I started coding and _every_ frontier model gave me obvious garbage. I've not seen such bad performance on such low hanging fruit for automated training in over a year.
Hey, good to see someone using ASCII
Don't forget File Separator 0x1c