Comment by dist-epoch
18 hours ago
If it's a new pretrain, the token embeddings could be wider - you can pack more info into a token making it's way through the system.
Like Chinese versus English - you need fewer Chinese characters to say something than if you write that in English.
So this model internally could be thinking in much more expressive embeddings.
No comments yet
Contribute on Hacker News ↗