Comment by everforward

5 days ago

This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.

I have no idea if the similar spelling will somehow help - I used that mostly because it's a simple way if illustrating the close relationship, but I suspect you'd find that the meanings of closely related words are likely to more directly overlap.

The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.

  • The same thing works for guessing German grammar from English. The farther back you go in English, the more its grammar resembles German.

    "What sayest thou?" -> "Was sagst du?"

    In fact, for the above, you don't even have to know a single German word. You just have to know what for question words, "wh" -> "w", that the English "y" at the end of a syllable usually comes from an older Germanic "g" sound, and that "th" was replaced by "d" in German. That gets you 90% of the way from early modern English to modern German in the above example.

    • That's interesting. I haven't thought about it in that direction before. I'm "of course" aware of the High German consonant shift, which also muddled things a lot (the continuum around to North Sea is a lot "cleaner" if you look at Plattdeutsch instead), but never thought much about what other simple transformations to apply with standard modern German.