Comment by KeplerBoy
5 days ago
How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."
I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.
If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."
So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.
I've noticed that it also imposes american moral judgements on certain things, even though it reasons (sometimes) in the native language.
I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.
It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.
1 reply →
Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.
This is how you use the tool correctly.
There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.
11 replies →
If you ask in French, it searches in French, right?
I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).
So then I have to ask it "can you repeat that in English please."
I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.
3 replies →
> their web searches need to be done in French to return reasonable results.
I wonder how much of this is also just the search engine's region setting.
It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.
1 reply →
Have you tried asking it to translate the prompt to French, and then feeding it the translated prompt?
I have the opposite problem. I often have to ask ChatGPT about things related to Norway and I have to constantly correct it when it keeps switching to responding in Norwegian no matter how many times I tell it to only answer in Norwegian when I request it.
What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move
It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.
9 replies →
Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.
Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)
Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.
Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.
( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )
7 replies →
> high quality transcriptions and translations of the stories currently described only in Norwegian into English
You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.
3 replies →
Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).
absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.
wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available?
Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.
There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.
Because state of the art models are owned and controlled by foreign agents.
Because you have so much money you don’t know what to do with it any more.
Permissions, probably. Copyrights and statutes. Knowing the librarians, unfortunately the prestige of their job is more vested in denying you access than giving you access.
I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.
LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).
If they wanted to they all have scanners and access to information on how to create torrents. Setting the information free isn't complicated, so it'd seem most of them, do not want to.
1 reply →
> Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
Uuh.. No? Especially of the training data, as in this case, is of better quality.
> Why go to the expense...
Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.
I assumed Scandinavia has better decision processes but apparently I was wrong.
Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already
Of course they speak swedish. But often, they do not reason in Swedish and do not search in swedish. Swedish makes up a tiny fraction of training data, while the vast majority is English, from the US. Which means the answers will always have a bias towards US culture, even if you ask in Swedish and the LLM answers in Swedish.
While Google does a good job with language support in their models, GPT-5.5 can't write proper Norwegian. It's even making up words that does not exist.
different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.
Does that include local distilled models? Because it didn't last time I checked for Norwegian.
Not really. For instance Facebook speech recognition models had Swedish support but no Norwegian.
Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.
Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.
Yep in the article it says ..the National Library .. has the single largest digital collection of Norwegian books, newspapers, web pages .. it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage .. an agreement with Norwegian newspapers permitted LLM training on copyrighted content.
Husnes said: ”No private company has this.”
So yeah they seem to have proprietary data...
> proprietary data
It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.
Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.
>Current-best models are pretty fluent at major languages and cultures
strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.
Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary
I'm Finnish and dear god I hate the default overtly friendly tones of LLMs. Always the first thing to tune in system prompt.
You're a machine, stop anthropomorphizing yourself and pretending to be my best friend, and just give me the damn answer and nothing else. :D
1 reply →
Set the personality to 'Robot', it makes the interactions so much more tolerable.
Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.
As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.
Quite true ?
English is ludicrously over abundant in training when compared to any language.
And that's probably necessary if you want a competent model. There simply isn't much norwegian literature on let's say banana farming.
It's probably just an excuse to play with LLMs using big government funding :)
yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc