← Back to context

Comment by moffkalast

19 days ago

> The models will be developed within Europe's robust regulatory framework, ensuring alignment with European values while maintaining technological excellence

As a European, that's practically an oxymoron. The more one limits oneself to legally clean data, the worse the models will be.

I hate to be pessimistic from the get go, but it doesn't sound like anything useful will be produced by this and we'll have to keep relying on Google to do proper multilinguality in open models because Mistral can't be arsed to bother beyond French and German.

I've been using Mistral past week due to changes in geopolitics, and Mistral works absolutely great in English. I haven't bothered in my native language yet, but in English it worked great. Better than my first experience with ChatGPT (GPT 3.5), actually.

  • Update: tried a couple of Dutch (my native language) queries, and it worked well. No issues whatsoever. Which is no surprise, given Dutch <-> English and vice versa translations often work very well.

  • Ok I see we're very far from being on the same page.

    Multilingualism in context of language models means something more than English, because that's what every model trained on the internet already knows. There aren't any I'm aware of that don't, since it would be exceedingly hard to exclude it from the dataset even if you wanted to for some reason. This is like the "what about men's rights" when talking about women's rights... yes we know, they're already entirely ubiquitous.

    But more properly I would consider LLM multilingualism straight up knowing all languages. We benchmark models on the MMLU and similar collections that contain all fields of knowledge known to man, so I would say it's reasonable to expect fluency of all languages as well.

  • I've been using mistral for most of January at the same rate as chatgpt before. I decided to pay for it as its per token (in and out) and the bill came yesterday... A whopping 1 cent. Thats probably rounded up.

    • > I decided to pay for it as its per token (in and out) and the bill came yesterday... A whopping 1 cent.

      Doesn't sound too good wrt their eventual profitability.

>As a European, that's practically an oxymoron. The more one limits oneself to legally clean data, the worse the models will be.

Train an LLM with text books and other legal books, you do not need to train it on pop culture to make it intelligent.

For face generations you might need to be more creative, you should not need milions of images stolen from social media to train your model.

But makes sense that tech giants do not want to share their data set and be transparent about stuff.

  • > Train an LLM with text books and other legal books

    Without licenses to the books, they are just as illegal (and maybe even moreso) than web content.

    • >Without licenses to the books, they are just as illegal (and maybe even moreso) than web content.

      There are books that are out of copyright, and also free books.

    • If LLM organizations are free to throw billions at hardware they can spare a paltry €50 million for 10 million e-books though, right?

      1 reply →

>Mistral can't be arsed to bother beyond French and German.

Any more details here or a writeup you can link to?

  • My own experience mainly, only Gemma seems to have been any good for Slavic languages so far, and only the 27B when unquantized is reliable enough to be in any way usable.

    Ravenwolf posts tests on his German benchmarks every so often in locallama and most models seem to do well enough, but I've heard some claims from people about Mistral's being their favorite models in German anyhow. And I think Mistral-Large scores higher than Llama-405B in French on lmsys and that's at least something one would expect from a French company.

    • In my experience Mistral (at least Nemo) works well with other languages. Don't know about Slavic languages but it does Romanian, with apparent issues around the translation of technical terms.

What do you mean by relying on Google?

Llama 3.1 and DeepSeek v3/R1 largest models are rather good at even a niche language like Finnish. The performance does plummet in the smaller versions, and even quantization may harm multilinguality disproportionally.

Something like deliberately distilling specific languages from the largest models could work well. Starting from scratch with a "legal" dataset will most likely fail as you say.

Silo AI (co-lead of this model) already tried Finnish and Scandinavian/Nordic models with the from-scratch strategy, and the results are not too encouraging.

https://huggingface.co/LumiOpen

  • Yes I think small languages which have a total corpus of maybe a few hundred million tokens total have no chance of producing a coherent model without synthetic data. And using synthetic data from existing models trained on all public (and less public) data is enough of a legally gray area that I wouldn't expect this project to consider it, so it's doomed before it even starts.

    Something like 4o is so perfect in most languages that one could just make an infinite dataset from it and be done with it. I'm not sure how OAI managed it tbh.