Comment by akoboldfrying

2 days ago

Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.

LLMs are probabilistic black boxes, trying to inject determinism in their natural language processing (as opposed to e.g. forcing a grammar for the output) may very well screw them over completely.

  • LLMs are ultimately just matrix multiplication and some other maths, nothing about them is inherently nondeterministic. When nondeterminism is present, it's because it was deliberately sprinkled on top (because it tends to produce better results).

    • Yes determinism is not the best word. What I mean is that if you force the LLM to output "carr+o" even when it prefers "carro", this could result in worse quality output.