Comment by kvam

5 days ago

As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.

I agree in principle.

That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.

But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...

Hard disagree. This is the first step not the last and proves to other countries that this can be done.

  • This model is going to start miles behind the frontier and the gap will only grow.

    • Why would the gap grow? There is no more training data to acquire, frontier model are training on the entire internet. Everything from now on is just fine-tuning.

      3 replies →

Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.

  • There's quite a bit more to culture and language than just being able to have transformers come up with believable language and/or dialect.

  • The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.

    • But what does that give you? If the model is far less capable? What will it do for you with that Norwegian data, that a better model could not do with better search or context?

      2 replies →

  • There is a lot more to it than literal translations. Even if an american can talk Norwegian it doesn't mean that they get the cultural context right.

    "Oh yeah after you drive 2 hours home from work your wife and kid will greet you with some delicious pie" doesnt work so well even if its in Norwegian.

  • Yes transformers are great at translation as that is their purpose.

    LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.

    There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.

    https://www.scientificamerican.com/article/chatgpt-is-changi...

  • Model can speak Lithuanian too, but with a Russian accent which is a big taboo for us.

    • I wonder, can you ask it not to do that first, and check if that still happens?

  • They're only good at it because they were trained on massive amounts of English and French data.

    • Not really true.

      Both Claude and ChatGPT can translate into minor dialects of Norwegian they will have seen very few works in because very few printed works exist in them.

      E.g. I've tested both my local spoken dialect, which is rarely written, and a sociolect used by a 1970's Maoist group consiting of a few hundred people, where most of the printed material consists of novels from a couple of ex-members that became authors.

      In the latter case, it claimed to not know, but was able to get a good match from just a description.

      I also just had it ape Norwegian orthography from the 1910's by having it look up the rules and translate a text it had first translated from English to modern Norwegian, and it did just fine.

      They will have seem some work in these dialects, but mostly it transfer really well to know related languages (English, Dutch, German, Swedish, Danish, roughly form a continuum from least in common to most in common with modern Norwegian; they all share vocabulary and significant parts of grammar with Norwegian), and then a relatively limited exposure to Norwegian itself is sufficient to do fairly well.

      They're also really good at "style transfer" of text in the form of tweaking orthography, word order, and minor grammar changes from descriptions and examples.

      (incidentally, the latter is one way of getting an LLM to sound a lot less like an LLM)

      2 replies →