Comment by mrshu
2 days ago
It is worth noting there is _another_, completely unrelated project (also) called *EuroLLM* that is also EU funded which not only shares many of the same goals, but has already fulfilled many of them:
1. large multilingual dataset
2. open science approach
3. competitive performance
Here is the HF blogpost that introduced it in December 2024 (along with various benchmarks): https://huggingface.co/blog/eurollm-team/eurollm-9b
The project's lead has summarized the situation succinctly in their LinkedIn post [0]
I hope the different communities collaborate openly, share their expertise, and don't decide to reinvent the wheel every time a new project gets funded. Next what? "OpenEuroLLM with real cheese"?
[0] https://www.linkedin.com/posts/andre-martins-31476745_ai-art...
Homepage: https://sites.google.com/view/eurollm
Deliverables:
- A series of models of different sizes for optimal effectiveness and efficiency (1B, 9B and 22B) trained on 4T tokens
- A multimodal model which can process and understand speech or text input
- Full project codebase available to the public with detailed data and model descriptions
I can't find the codebase yet though
Results don't seem that bad for 9b https://huggingface.co/blog/eurollm-team/eurollm-9b
I've been running it with Ollama, it's actually pretty good for working with text in Latvian (and other EU languages). I'd be hard pressed to find another model of a similar size that's good at it, for example: https://huggingface.co/spaces/openGPT-X/european-llm-leaderb...
This won't be relevant to most people here, but it's cool to see even the smaller languages getting some love, instead of getting garbage outputs from Qwen (some versions of which are otherwise pretty good for programming) and anything below Llama 70B, or maybe looking at Gemma as a middle ground.
"...EuroLLM-9B was trained on approximately 4 trillion tokens, using 400 Nvidia H100 GPUs on the MareNostrum5 supercomputer..."
1 reply →
Thanks for the heads up, I missed this project! However, on their page they write "Project Timeline: 1 May 2024 - 30 April 2025". April isn't far away, anyone knows what's supposed to happen afterwards?
That timeline is just for the preliminary hearing on potential committee members.
No sarcasm, sorry.