← Back to context

Comment by embedding-shape

5 days ago

Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.

Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)

Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.

Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.

( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )

  • Sorry if I was unclear, I didn't want to give the impression I think translations or even transcriptions in some cases is easy, or without problems, or not painstakingly time-consuming, it very much is.

    I just think building a LLM from scratch is ever harder, with more potential problems that are harder to solve, more time-consuming and even more resource-intensive.

    • It would require an investment, but those will pay dividends later, as it becomes easier to train LLMs on/for Norwegian. If we need to translate everything to English we might as well just drop using Norwegian altogether. Practically everyone speaks English fluently already...

      4 replies →

  • > in both major dialects

    Nynorsk and bokmål is not dialects but variants of written Norwegian.

> high quality transcriptions and translations of the stories currently described only in Norwegian into English

You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.

  • Yes, why wouldn't it be easier to transcribe and translate, skills humanity had for centuries, compared to LLMs that we've only learnt to build these last few years, and even require a frikken computer to do? Of course one of these is harder than the other...

    • Look at it from this lens: translating and transcribing these stories hasn't happened for the centuries they existed, while as you point out the skills where always there. In contrast LLMs have been here for a few years at most and everyone and their dogs are trying to get in on the "race".

      With absolutely no insight into why, which one has better odds to happen first is obvious to me.

      1 reply →

Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).