ETH Zurich and EPFL to release a LLM developed on public infrastructure

1 day ago (ethz.ch)

I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.

  • No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

    Source: I'm part of the training team

  • Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.

    I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.

  • When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using? I'm curious about the benchmarks!

  • The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.

    But it's good to have more and more players in this space.

  • I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.

    • SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.

      1 reply →

"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

  • I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.

    • Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.

      [1] https://arxiv.org/abs/2504.06219

    • My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more

  • Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.

    I understand the web is a dynamic thing but still it would seem to be useful on some level.

  • No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.

    • > Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master

      How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.

      11 replies →

ETH Zurich is doing so many amazing things that I want to go study there. Unbelievable how many great people are coming from that university

Is this setting the bar for dataset transparency? It seems like a significant step forward. Assuming it works out, that is.

They missed an opportunity though. They should have called their machine the AIps (AI Petaflops Supercomputer).

  • I think that the Allen Institute for Artificial Intelligence OLMo models are also completely open:

    OLMo is fully open

    Ai2 believes in the power of openness to build a future where AI is accessible to all. Open weights alone aren’t enough – true openness requires models to be trained in the open with fully open access to data, models, and code.

    https://allenai.org/olmo

The open training data is a huge differentiator. Is this the first truly open dataset of this scale? Prior efforts like The Pile were valuable, but had limitations. Curious to see how reproducible the training is.

  • > The model will be fully open: source code and weights will be publicly available, and the training data will be transparent and reproducible

    This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.

    • That wouldn't seem reproducible if the content at those URLs changes. (Er, unless it was all web.archive.org URLs or something.)

      1 reply →

    • Yeah, I suspect you're right. Still, even a list of URLs for a frontier model (assuming it does turn out to be of that level) would be welcome over the current situation.

  • Yup, it’s not a dataset packaged like you hope for here, as it still contains traditionally copyrighted material

The press release talks a lot about how it was done, but very little about how capabilities compare to other open models.

  • It's a university, teaching the 'how it's done' is kind of the point

    • Sure, but usually you teach something that is inherently useful, or can be applied to some sort of useful endeavor. In this case I think it's fair to ask what the collision of two bubbles really achieves, or if it's just a useful teaching model, what it can be applied to.

  • The model will be released in two sizes — 8 billion and 70 billion parameters [...]. The 70B version will rank among the most powerful fully open models worldwide. [...] In late summer, the LLM will be released under the Apache 2.0 License.

    We'll find out in September if it's true?

Any info on context length or comparable performance? Press release is unfortunately lacking on technical details.

Also I'm curious if there was any reason to make such a PR without actually releasing the model (due Summer)? What's the delay? Or rather what was the motivation for a PR?

I'm disappointed. 8B is too low for GPUs with 16 GB VRAM (which is still common in affordable PCs), where most 13B to 16B models could still be easily run, depending on the quantization.

I wonder if multilingual llms are better or worse compared a single language model

  • This is an interesting problem that has various challenges - currently most tokenization solutions where trainees using hype pair encoding where the most commonly seen combinations of letters were being selected to be a mapping. This meant that the majority of tokenization was English mappings meaning your LLM had a better tokenization of English compared to other languages it was being trained on.

    C.f. https://medium.com/@biswanai92/understanding-token-fertility...

Pretty proud to see this at the top of HN as a Swiss (and I know many are lurking here!). These two universities produce world-class founders, researchers, and engineers. Yet, we always stay in the shadow of the US. With our top-tier public infrastructure, education, and political stability (+ neutrality), we have a unqiue opportunity to build something exceptional in the open LLM space.

  • I think EPFL and ETH are generally well known internationally, but Switzerland being rather small (9M pop), it's only natural you don't hear much about it compared to other larger countries!

The article says

“ Open LLMs are increasingly viewed as credible alternatives to commercial systems, most of which are developed behind closed doors in the United States or China”

It is obvious that the companies producing big LLMs today have the incentive to try to enshitify them. Trying to get subscriptions at the same time as trying to do product placement ads etc. Worse, some already have political biases they promote.

It would be wonderful if a partnership between academia and government in Europe can do a public good search and AI that endeavours to serve the user over the company.

  • Yes but it’s a very complicated service to deliver. Even if they train great models, they likely will not operationalize them for inference. Those will still be private actors, and the incentives to enshittify will be the same. Also, for AI generally the incentives is much higher than last tech generation, due to cost of running these things. Basically, the free services where you’re the product must aggressively extract value out of you in order to make a profit.

Use case for science and code LLMs: Superhydrodynamic gravity (SQR / SQG, )

LLMs do seem to favor general relativity but probably would've favored classical mechanics at the time given the training corpora.

Not-yet unified: Quantum gravity, QFT, "A unified model must: " https://news.ycombinator.com/item?id=44485226

Microsoft has a new datacenter that you don't have to keep adding water to; which spares the aquifers.

How to use this LLM to solve energy and sustainability problems all LLMs exacerbate? Solutions for the Global Goals, hopefully

gross use of public infrastructure

  • Sometimes ago there was a Tom Scott video about the fasted accelerating car in the world, developed by a team with a vast majority of student. One remark stayed with me: "the goal is not to build a car, but to build engineer".

    In that regard it's absolutely not a waste of public infra just like this car was not a waste.

  • It even used green power. Literally zero complains or outcry from the public yet. Guess we like progress, especially if it helps independence.

  • University and research clusters are built to run research code. I can guarantee this project is 10x as impactful and interesting as what usually runs on these machines. This coming from someone in the area that usually hogs these machines (numerical simulation). I'm very excited to see academic actors tackle LLMs.

Why would you announce this without a release? Be honest.

  • The announcement was at the International Open-Source LLM Builders Summit held this week in Switzerland. Is it so strange that they announced what they are doing and the timeline?

  • Funding? Deeply biasing European uses to publicly-developed European LLMs (or at least not American or Chinese ones) would make a lot of sense. (Potentially too much sense for Brussels.)

This seems like the equivalent of a university designing an ICE car...

What does anyone get out of this when we have open weight models already ?

Are they going to do very innovative AI research that companies wouldn't dare try/fund? Seems unlikely ..

Is it a moonshot huge project that no single company could fund..? Not that either

If it's just a little fun to train the next generation of LLM researchers.. Then you might as well just make a small scale toy instead of using up a super computer center

  • Why do you think it's about money? IMO it's about much more than that, like independence and actual data freedom trough reproductive LLMs

  • This model will be one of the few open models where the training data is also open which makes it ideal for fine tuning.

  • That it will actually be open and reproducible?

    Including how it was trained, what data was used, how training data was synthesized, how other models were used etc. All the stuff that is kept secret in case of llama, deepseek etc.

  • Super computers are being used daily for much toy-ier codes in research, be glad this at least interests the public and constitutes a foray of academia into new areas.