Comment by bonzini

2 days ago

I think it's plausible that different languages would prefer different tokenizations. For example in Spanish the plural of carro is carros, in Italian it's carro. Maybe the LLM would prefer carr+o in Italian and a single token in Spanish.

Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.

  • LLMs are probabilistic black boxes, trying to inject determinism in their natural language processing (as opposed to e.g. forcing a grammar for the output) may very well screw them over completely.

    • LLMs are ultimately just matrix multiplication and some other maths, nothing about them is inherently nondeterministic. When nondeterminism is present, it's because it was deliberately sprinkled on top (because it tends to produce better results).

      1 reply →