Comment by rldjbpin

4 days ago

may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.

see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.

language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.

[1] https://www.sarvam.ai/blogs/sarvam-30b-105b

2 comments

rldjbpin

kgeist 4 days ago

>but there remains a seemingly obvious use case for non-latin languages to do things from scratch

>see sarvam.ai and their tokenisation improvements on local languages

You don't need to build from scratch to improve tokenization, though.

Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).

rldjbpin 4 days ago

the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.
the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.
unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.