← Back to context

Comment by Der_Einzige

1 year ago

All of these issues are entirely due to the tokenization scheme. Literally all of them

You could get this behavior implemented perfectly with constrained text gen techniques like grammars or any of the various libraries implementing constrained text gen (i.e. guidance)

I had briefly looked into Guidance and others (LMQL, Outlines) but I couldn't figure out how to use them for this problem.

I could think of how to use them to prevent the LLM from generating digits for numbers greater than ten by using a regex plus a constraint that forbids digits, but the main problem is the other part of the rule, i.e. numbers above 10 should never be spelled out and should be written as digits instead. For that I presume you need to identify the spelled out numbers first, for which you presumably would need the LLM so you're back to LLM fallibility.

Any pointers would be greatly appreciated.