Comment by Zigurd

6 days ago

Based on the code that it's good at, and the code that it's terrible at, you are exactly right about LLMs being shaped by their training material. If this is a fundamental limitation I really don't see general purpose LLMs progressing beyond their current status is idiot savants. They are confident in the face of not knowing what they don't know.

Your experience with Arabic in particular makes me think there's still a lot of training material to be mined in languages other than English. I suspect the reason that Arabic sounds 20 years ago is that there's a data labeling bottleneck in using foreign language material.

6 comments

Zigurd

parineum 6 days ago

I've had a suspicion for a bit that, since a large portion of the Internet is English and Chinese, that any other languages would have a much larger ratio of training material come from books.

I wouldn't be surprised if Arabic in particular had this issue and if Arabic also had a disproportionate amount of religious text as source material.

I bet you'd see something similar with Hebrew.

mentalgear 6 days ago
I think therein lies another fun benchmark to show that LLM don't generalize: ask the llm to solve the same logic riddle, only in different languages. If it can solve it in some languages, but not in others, it's a strong argument for just straightforward memorization and next token prediction vs true generalization capabilities.
- zahlman 6 days ago
  
  I would expect that the "classics" have all been thoroughly discussed on the Internet in all major languages by now. But if you could re-train a model from scratch and control its input, there are probably many theories you could test about the model's ability to connect bits of insight together.
Zigurd 6 days ago

While computer languages are different and significantly simpler than human languages, LLMs as coding agents don't seem phased by being told to implement in one language based on an example in another. Before they were general purpose chat bots, LLMs were used in language translation.
eshaham78 6 days ago

[dead]

harrall 6 days ago

Humans are also shaped by the training material… maybe all intelligence is.

Talk to people with extreme views and you realize they are actually rational, but the world they live in is not normal or typical. When you apply perfectly sound logic to a deformed foundation, the output is deformed. Even schizophrenic people are rational… Logic is never the problem, it’s always the training material.

Anyway that’s why we had to build a mathematical field of statistics and create tools like sample sizes and distributions to generalize.