Comment by cyberge99

1 day ago

Forgive me if this is a naive assumption, but wouldn’t large language models be fundamentally different for a language that is largely symbols? Again, my understanding of Mandarin is limited if it exists at all.

All tokens are symbols. All of the frontier models speak Mandarin.

"飞机" and "airplane" aren't fundamentally different in terms of how they're represented to a computer. Especially for an LLM, where tokenization likely turns each of those into a single token.