Comment by bonzini

1 month ago

I think it's plausible that different languages would prefer different tokenizations. For example in Spanish the plural of carro is carros, in Italian it's carro. Maybe the LLM would prefer carr+o in Italian and a single token in Spanish.

4 comments

bonzini

akoboldfrying 1 month ago

Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.

bonzini 1 month ago
LLMs are probabilistic black boxes, trying to inject determinism in their natural language processing (as opposed to e.g. forcing a grammar for the output) may very well screw them over completely.
- akoboldfrying 25 days ago
  
  LLMs are ultimately just matrix multiplication and some other maths, nothing about them is inherently nondeterministic. When nondeterminism is present, it's because it was deliberately sprinkled on top (because it tends to produce better results).
  
  1 reply →