Comment by WatchDog
5 days ago
If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."
So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.
I've noticed that it also imposes american moral judgements on certain things, even though it reasons (sometimes) in the native language.
I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.
It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.
Would be curious about the model and the prompt for this.
Not kidding at all. I had a similar issue with a project where I needed to classify images into specific demographics, and Gemini, while capable, was entirely not going to do the task… until in my JSON response I left room for it to tell me why this was not a good idea and why it was culturally insensitive. Then boom… full JSON array: hair color, eye color, skin color, fitness level, likely ethnicity, likely country of origin, and about 10 other values.
You’re probably wondering what on earth I was working on. I was matching Ai gen headshots to Ai voices so that in an app the voice picker had human (Ai) faces.
Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.
This is how you use the tool correctly.
There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.
The issue is that French, Italian, African, Japanese people shouldn't have the inconvenience of instructing the LLM tool to get the basic facts about their own culture. They should use an LLM that has already been trained like that by default. Nobody has obligation to use a tool that thinks it is talking to an American. If I go to Google for example I want to get facts about my own country in my own language.
9 replies →
> Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
Most ordinary people will just use their native language and they have no way of knowing that the model always reasons in English and therefore is strongly biased toward using English search terms. So they don't know they have to remind the model to search in their local language.
If you ask in French, it searches in French, right?
I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).
So then I have to ask it "can you repeat that in English please."
I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.
> If you ask in French, it searches in French, right?
not necessarily. i often prompt Claude in German and then see the reasoning happening in English. of course it will eventually reply in German, but that does not mean that the tooling in the background was using German.
Same for me - I mostly ask stuff in English but sometimes add specific terms or names in Japanese as needed. My Japanese is intermediate, but it will often switch immediately and reply only and entirely in Japanese. I'm pretty sure they have a system prompt with hairline triggers for foreign languages BECAUSE of the overrepresentation of English in the training corpora.
[dead]
> their web searches need to be done in French to return reasonable results.
I wonder how much of this is also just the search engine's region setting.
It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.
I gave up on Google's language and region settings a long time ago, years before giving up on google as a product.
To this day they still think I'm in Sweden sometimes, in Paris other times, or in Germany, while I haven't lived in any of those places for years.
Have you tried asking it to translate the prompt to French, and then feeding it the translated prompt?
I have the opposite problem. I often have to ask ChatGPT about things related to Norway and I have to constantly correct it when it keeps switching to responding in Norwegian no matter how many times I tell it to only answer in Norwegian when I request it.
What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move
It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.
I highly doubt that hiring people who don't even speak the language would result in a better model for Norwegian. If anything, they could pay Anthropic for some tips and tricks for training. But that does not seem necessary as Deepseek & co detail everything for free
>They probably have enough money to fund their own Anthropic competitor.
Which is bizarre to me Norway doesn't have a booming tech sector with all hat wealth fund acting as the biggest VC.
They instead use their wealth fund to invest in US's tech sector. Baffling.
7 replies →
Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.
Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)
Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.
Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.
( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )
Sorry if I was unclear, I didn't want to give the impression I think translations or even transcriptions in some cases is easy, or without problems, or not painstakingly time-consuming, it very much is.
I just think building a LLM from scratch is ever harder, with more potential problems that are harder to solve, more time-consuming and even more resource-intensive.
5 replies →
> in both major dialects
Nynorsk and bokmål is not dialects but variants of written Norwegian.
> high quality transcriptions and translations of the stories currently described only in Norwegian into English
You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.
Yes, why wouldn't it be easier to transcribe and translate, skills humanity had for centuries, compared to LLMs that we've only learnt to build these last few years, and even require a frikken computer to do? Of course one of these is harder than the other...
2 replies →
Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).
absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.
wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available?
Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.
There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.
Because state of the art models are owned and controlled by foreign agents.
Because you have so much money you don’t know what to do with it any more.
Permissions, probably. Copyrights and statutes. Knowing the librarians, unfortunately the prestige of their job is more vested in denying you access than giving you access.
I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.
LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).
If they wanted to they all have scanners and access to information on how to create torrents. Setting the information free isn't complicated, so it'd seem most of them, do not want to.
Where do you seed a 60 petabyte torrent? I'm sure some choice cuts of what individuals feel is important have made it to Anna's, but I don't think refusal to go on a full data liberation spree is evidence they don't care.
> Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
Uuh.. No? Especially of the training data, as in this case, is of better quality.
> Why go to the expense...
Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.
I assumed Scandinavia has better decision processes but apparently I was wrong.