Comment by vidarh
5 days ago
> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.
E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.
But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.
Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.
This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.
I have no idea if the similar spelling will somehow help - I used that mostly because it's a simple way if illustrating the close relationship, but I suspect you'd find that the meanings of closely related words are likely to more directly overlap.
The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.
The same thing works for guessing German grammar from English. The farther back you go in English, the more its grammar resembles German.
"What sayest thou?" -> "Was sagst du?"
In fact, for the above, you don't even have to know a single German word. You just have to know what for question words, "wh" -> "w", that the English "y" at the end of a syllable usually comes from an older Germanic "g" sound, and that "th" was replaced by "d" in German. That gets you 90% of the way from early modern English to modern German in the above example.
1 reply →