Comment by spacebanana7
5 days ago
I suspect this also happens in programming languages. Subjectively I get the feeling that LLMs prefer to write in Python or JS.
Would be interesting to see whether they actually score better in leetcode questions when using python.
See my other comment. The answer is transfer learning: leveraging massive amounts of data in one language like Python, a few bridges to another language like Ruby, and obtain a “native” result in the other language.
But in this case the LLM is not exposed to explicit translation pairs between these two languages and rather by seeing enough examples in similar contexts, LLMs transfer some of their learnings in Python to Ruby (for better or worse results)
Based on my very very limited understanding of how LLMs work, surely they don't "prefer" anything, and just use what they have been trained on?
Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
Maybe the LLM preferring English is because of a similar phenomenon - it has been trained on mostly western, English speaking internet?
There are likely some languages that are genuinely easier or more difficult for LLMs.
For example consider Pascal or C89 requiring all variables to be declared at the start of the function body. That makes it much harder to generate code in a linear fashion. In Python you can just make up a variable the moment you decide you need it. In Pascal or C89 you would have to go back and change previous code, which LLMs can't easily do.
Similar things likely apply to strict typing. Typing makes it easier to reason about existing code, but it makes it harder to write new code if you don't have the ability to go back and change your mind on a type choice.
Both could be solved if we selected tokens in a beam search, searching for the path with the highest combined token probability instead of greedily selecting one token at a time. But that's much more expensive and I'm not sure anyone still does that with large-scale LLMs.
You could ask the LLM to first work out the solution in pseudocode, then translate to Pascal (or whatever). That way the variables are known after the initial pseudocode pass.
Human programmers also did this more frequently in those days than probably is the case now.
> Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
This likely plays a major - probably dominant - role.
It's interesting to think of other factors too though. The relatively concise syntax of those languages might make them easier for LLMs to work with. If resources are in any way token limited then reading and writing Spring Boot apps is going to be burdensome.
Those languages also have a lot of single file applications, which might make them easier for LLMs to learn. So much of iOS development for example is split across many files and I wonder if that affects the quality of the training data.
Also worth considering: there's a wider range of "acceptable" output programs when dealing with such forgiving scripting languages. If asked to output C then there are loads of finicky bits it could mess up, pointer accesses, writing past the end of an array, using uninitialized memory, using a value it already freed, missing a free, etc. All things that the language runtime handles in Python or JS. There's a higher cognitive load it needs to take on.