Comment by parineum
7 days ago
I've had a suspicion for a bit that, since a large portion of the Internet is English and Chinese, that any other languages would have a much larger ratio of training material come from books.
I wouldn't be surprised if Arabic in particular had this issue and if Arabic also had a disproportionate amount of religious text as source material.
I bet you'd see something similar with Hebrew.
I think therein lies another fun benchmark to show that LLM don't generalize: ask the llm to solve the same logic riddle, only in different languages. If it can solve it in some languages, but not in others, it's a strong argument for just straightforward memorization and next token prediction vs true generalization capabilities.
I would expect that the "classics" have all been thoroughly discussed on the Internet in all major languages by now. But if you could re-train a model from scratch and control its input, there are probably many theories you could test about the model's ability to connect bits of insight together.
While computer languages are different and significantly simpler than human languages, LLMs as coding agents don't seem phased by being told to implement in one language based on an example in another. Before they were general purpose chat bots, LLMs were used in language translation.
[dead]