Comment by dragonwriter
14 days ago
> The model wasn't trained on those languages (yet).
It probably has been trained on them (it was trained on 40 trillion tokens covering 200 languages, they almost certainly didn't avoid CJK languages.
They only have been further fine-tuned on a set of 12 languages. (I wonder if that is the set the base Behemoth model both are distilled from had been trained on when they were distilled; Behemoth is apparently not completely finished, and perhaps there will be further revisions of the distilled models as it is.)
No comments yet
Contribute on Hacker News ↗