Comment by palisade

6 days ago

I've been contemplating a decentralized model training system for some time using volunteer machines that we all contribute. But, it is astronomically difficult. The communication speeds are untenable.

And, there is the issue of data poisoning from untrusted nodes. I've almost cracked that last issue with a self-healing checkpointed rollback system that doesn't have to throw out anything that follows the corrupt datum.

But, I'm just one person with an idea and I don't have infinite funds to make this happen. This isn't a small project.

Maybe there would be interest in something like this, now that entire frontier labs are being banned from making further progress.

The total power of all GPUs on the planet dwarf their capabilities, if we had a way to harness them in a distributed way efficiently. We wouldn't be able to train a Fable as fast as them, but eventually having access is better than never having access.

108 comments

palisade

sho 6 days ago

As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.

The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.

And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.

It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.

ux266478 6 days ago
AI hardware is for inference, not training. Training uses normal HPC crap. Superpods aren't really power efficient, it's kind of a meme, and it stems from limiting the power draw of other components by having less of them. It's more of a rounding error.
> you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
Costs spread over a large population, it really doesn't matter. You're not getting hundreds of thousands of people to pitch half their monthly electric bill to pay for someone else's datacenter. They will pay the electricity themselves quite happily though, if all they need to do is give you compute. This isn't new.
Interconnect is the bottleneck for distributed training, nothing else really.
- rurban 6 days ago
  
  You got it wrong. Inference can use crap GPU's. Training needs the 100x more expensive big guns. Our training machine is 100x more expensive than our inference machine.
  
  4 replies →
- sho 6 days ago
  
  > AI hardware is for inference, not training
  Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"
  > Superpods aren't really power efficient
  Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that
  Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.
  
  1 reply →
- pksebben 6 days ago
  
  Bit of a doozie though, that one.
  I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.
- dyauspitr 6 days ago
  
  That makes no sense. It’s basically the same calculations for training as well.
- Davidzheng 6 days ago
  
  Are you sure most of frontier cost isn't inference in RL environments?
incrudible 6 days ago

> As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.
The first part is not really true though, the chips are not that much faster, the DRAM is not that much faster, and in aggregate it does not matter because there is just so much more consumer hardware out there (although perhaps that is changing as supply shifts toward datacenters).
The interconnect and data locality is the problem. If you could train it like e.g. you can render a scene with monte carlo ray tracing, any result from any node could be merged with any other and the combined result would have converged closer to the limit. I am sure research in that direction exists, it just has not proven effective within the scales it has been attempted.
WithinReason 6 days ago
Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.
- schobi 6 days ago
  
  I guess this was more related to syncing GPUs.
  If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.
  But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.
  You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.
  
  2 replies →
- zozbot234 6 days ago
  
  The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.
- GeoAtreides 6 days ago
  
  Math is math, but sadly math isn't physics nor engineering.
  
  2 replies →
c7b 6 days ago
Could you put some numbers and examples behind the efficiency gap between data center and consumer-grade AI hardware? Did you include examples like the RTX Spark on the consumer side? I was always amazed at the low power consumption of unified memory style architectures. In absolute terms and even more so compared to consumer-grade GPUs. I'd be genuinely interested in a comparison with data-center-grade hardware.
- zozbot234 6 days ago
  
  DGX Spark is effectively prosumer hardware, better than most consumer stuff but still not comparable to actual datacenter gear. You can't just look at TDP in isolation without also comparing performance.
- aspenmartin 6 days ago
  
  It's more than the raw hardware, it's the interconnect and communication between the hardware at scale. These models are trained on hundreds of thousands of GPUs today. You _will_ start to see cross-datacenter training runs but this needs to efficiently decide when and how to communicate across datacenter, which bears a very high cost compared to intra-datacenter communication.
CuriouslyC 6 days ago
> It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.
100% agree. The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it. The people owning the compute infrastructure and capturing more profit from AI at that layer is the safest, cleanest way to increase revenue capture, a sovereign wealth fund is a mediocre idea because it's possible to play shell game with stocks and redirect profit/debt (venture capital is quite good at this!).
- root-parent 6 days ago
  
  >> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it
  Currently AI has generated no profit. And as it sits, is a non viable business.
  I refuse to include the sellers of shovels as AI revenue.
  If the companies buying the shovels are still losing money, then the tool supplier fortunes have nothing to do with the economics of the AI application layer, who is losing money on every prompt.
  
  6 replies →
- aspenmartin 6 days ago
  
  > The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it.
  Any actual numbers to back this up? I don't see how nationalizing a very cutting edge technology outside of wartime is going to go super well. The leverage that these companies have is the same leverage that TSMC has: you can't just take over and expect things to rocket at the pace its going
- AtlasBarfed 6 days ago
  
  Like a system of heavily funded institutions dedicated to higher learning?
iugtmkbdfil834 6 days ago
Dunno, in a sense, torrents came among similar restrictions. Everything at consumer level was just plain awful and at dial up level, mebbe ISDN if you were very lucky, with fiber only available to ridiculously rich people and corps. But with restrictions, came approaches on how to mitigate them.
- aspenmartin 6 days ago
  
  Yes but not violations of the laws of physics. You need extremely fast communications, memory bandwidth, etc; you cannot get that with distributed training. You're up against the speed of light and the interconnect that powers the internet. You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.
  
  3 replies →
- boutell 6 days ago
  
  If weights can't be looked at almost instantly in bulk, it just doesn't work. It's a different problem from distributing file downloads.
  
  1 reply →
herewulf 6 days ago

WRT government data centers, there is certainly precedent for independent researchers getting HPC time on systems owned by US national labs, research institutions, universities, and then publishing their results as part of the public good.
One would question why this hasn't already happened as the rule and as opposed to the proliferation of private data centers. However, I am sure the answers are plain and perhaps saddening to us all.
Cider9986 6 days ago
What makes you think Deepseek or GLM won't catch up to Fable level? Why would there be a break in the trend now?
- zozbot234 6 days ago
  
  DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.
  
  10 replies →
- metalspot 6 days ago
  
  The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.
- kuboble 6 days ago
  
  I think there are at least few question marks.
  One being that extrapolating from like 3 data points is hardly science. All trends break at some point.
  The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.
KaiserPro 6 days ago

> It would be better for governments to buy and own their own datacenters,
I mean thats good, but they'd have to also build thier own dataset. Which involves either paying people, or breaking the law.
Plus if they do manage to make it work, they will not get any tax revenue from it, as it'll remove the need for labour, which is where a huge amount of tax revenues come from.
its a deeply hard problem with lots of second/third order effects.

trenchgun 6 days ago

>But when people think of decentralized training, they don’t first think of gigantic datacenters, owned by the same company, training models across large distances. Instead, they imagine thousands of small datacenters, or individual consumers, pooling their spare compute over the internet to orchestrate a training run larger than any single actor could manage alone. Many companies are pursuing this vision: Pluralis Research, Prime Intellect and Nous Research have already successfully decentrally trained models at scale. But in practice, training decentrally over the internet has lagged far behind more centralized training. Even their largest models (Pluralis’ 8B Protocol Model, Prime Intellect’s INTELLECT-1, and Nous’ Consilience 40B) have been trained with 1,000x less compute than today’s frontier models (such as xAI’s Grok 4). https://epoch.ai/gradient-updates/how-far-can-decentralized-...

killerstorm 6 days ago

I think it's fundamentally not useful as long as there are other open source model releases. E.g. suppose you make SotA model at a particular size via decentralized training. Amazing. In a month Qwen/Deepseek/etc release a new model which is better. So why would you use the "decentralized one"?
Models have limited shelf live while things are improving rapidly, and decentralized training is just more wasteful.
However, things might change if we get to what Karpathy calls "cognitive core" - a stable model backbone which can be extended via skills/adapters/etc. Development of extensions to the core can be a lot more decentralized.
But for now these decentralized training attempts function largely as a deterrent to anti-open-source collusion

girvo 6 days ago

> The total power of all GPUs on the planet dwarf their capabilities

That just isn't true. It misunderstands exactly how much silicon has gone directly to those companies, and exactly how much more powerful said silicon is compared to consumer grade gear.

sho 6 days ago
If folding@home is a useful yardstick by which we might estimate the amount of GPU-ish capability that civilians might be coaxed into donating to a shared enterprise, yeah, it doesn't look pretty. This is extremely rough napkin math but comparing to xAI's Collosus 2 for example, for training workflows you're probably looking at 4-5 orders of magnitude the capability of all of folding@home combined. That's 100,000 times faster.
Very rough math like I said but I doubt it's directionally wrong.
And even if you did force literally everyone on earth with some sort of GPU to max it out 24/7 in service of an open source AI training enterprise - you would waste so much power trying to use that inefficient consumer hardware with the worst latency imaginable that it would be cheaper and faster to get everyone to instead chip in some cash to buy a datacenter with blackwell chips instead! So the idea has no legs whatsoever.
- haritha-j 6 days ago
  
  Plus a scientific project to benefit all of humanity doesn’t have quite the same ring as the thing thats stealing your job, from the volunteer’s perspective
- WithinReason 6 days ago
  
  folding@home reached 2.43 exaflops by April 12, 2020, which would make it the largest supercomputer on the planet.
  
  1 reply →

cpdomina 6 days ago

there was a project trying to achieve some of those goals a few years ago using p2p: petals https://github.com/bigscience-workshop/petals

their bloom model was also a collaborative effort https://huggingface.co/docs/transformers/en/model_doc/bloom

androiddrew 6 days ago

I was wondering what happened to this

WithinReason 6 days ago

The gradient info can be compressed 10000x with the right tricks, I think it is achievable. Nous claims they did it already:

https://github.com/NousResearch/DisTrO

There are other gradient compression papers from the past reporting large compression rates

Davidzheng 6 days ago

Is the total compute capacity outside of meta, google, amazon, anthropic, oai and x is higher than even the capacity of any of them? In any case, there's no chance a public collaboration gets to anthropic levels of compute even if communication were no issue.

kelnos 6 days ago
Is the issue that training with less compute takes more time? Or is it just not possible? I think a collective using distributed training could tolerate the idea that it takes 10x as long as Anthropic to train a model, or whatever.
- mike_hearn 6 days ago
  
  It's possible but it's not linear. A modern AI training cluster is a supercomputer that uses very different architectures and hardware to a bunch of small PCs connected via normal networking. The networking advantage alone kills any chance of decentralized training.

laserx 6 days ago

there are some strong open source groups like NOUS research taking the fight https://nousresearch.com/

edg5000 6 days ago

It seems this project is serious and very promising. They have the Psyche network which seems real and operational. They're able to produce ~50B-class models, this will only grow over time of course. Very cool.

mycall 6 days ago

Maybe the training approaches taken to date are wrong for decentralized systems. Setup a virtual subnet you can trust and do training on that. Create a AI model island in a trusted/federated model system -- definitely slower than the typical 'one big model' approach, but scalable to world size modeling.

Also, it wouldn't be able to use a transformer architecture. For inspiration, take a look at Google Maps and how it a much more efficient A* divide/conquer hill-climbing architecture. Think minimized matrix math.

bradfa 6 days ago

Other comments also hint at this idea, a distributed training solution is currently an open research problem. Solving it is not easy, yet. But 10 years ago what we have today for LLMs would have looked similarly impossible, so have hope, and apply yourself to the problem if you find it interesting!

whiplash451 6 days ago

This could be of interest to you: https://thealliance.ai/projects/tapestry

procflora 6 days ago

Man, that project is such bait for my particular sensibilities but just looking at the copy about not sharing your data and only sharing weights has me feeling very disappointed in the project already. I would want a project like this to not elide fact that sharing your weight updates probably effectively means sharing your data too.

andai 6 days ago

>The communication speeds are untenable.

Can it be parallelized or not?

If you take a model, make two copies, and fine-tune each one on different data, what happens when you merge them? Does it work if you freeze different layers?

I think this works if the steps are small enough. And the transfer should become tenable if the steps are big enough. Where's the cutoff?

mike_hearn 6 days ago
Yes it can be parallelized, it already is in real AI datacenters and no it doesn't help you. Like everyone else is saying, an AI datacenter is not just a bunch of gaming GPUs connected via normal ethernet and hasn't been for years.
At most a decentralized effort could contribute a little bit to some bigger centralized effort by doing inference and sandboxed CPU work. Modern model training isn't just backprop, it's got a huge and growing CPU and inferencing component too, which doesn't require intense inter-node communication. For instance, doing RL rollouts for agentic coding requires a lot of plain old inferencing and sandboxed containers for the models to practice in. The final results are just a set of rollouts and scores that can be uploaded back to a central datacenter for GRPO to adjust the weights (relatively cheap). But then, of course, you'd have to stick to models small enough to fit on people's computers so it'd never be competitive.
- andai 6 days ago
  
  Kinda sounds like we just need better computers.
dangerlego5 6 days ago

[flagged]

Andrew_sooter 6 days ago

Have you checked out [petals](https://petals.dev/) It’s doing the same thing, however the project is written in python and there can be some optimizations to make it much more faster.

Catloafdev 6 days ago

Ya that'd be an awesome project, the only issue is how do you verify it's not being poisoned? To actually validate it would require more analysis than the training took to run. It would require a trusted network, not an open one, unless that can get solved somehow.

sgsjchs 6 days ago
Make multiple nodes do the same job, compare results.

logicchains 6 days ago

>I've been contemplating a decentralized model training system for some time using volunteer machines that we all contribute. But, it is astronomically difficult. The communication speeds are untenable.

It is already possible: https://arxiv.org/abs/2603.08163 . You don't need to sync so frequently, so it can be done over normal internet, it's just less efficient (takes longer to converge).

rustcleaner 6 days ago

Could it be done by making a sparse MoE of thousands, or tens of thousands, of smaller experts in very niche domains? Maybe a tree-like structure of experts which can delegate from relatively general but inaccurate to extremely niche but accurate? Also these experts might be plug-and-play, easily swap out an inferior expert with a stronger one in the future without having to redo the whole pile?

Zetaphor 6 days ago
That's not really how the experts in an MoE work. They activate on token probabilities and are activated on every token. You don't necessarily have a discrete math expert and a discrete physics expert. And if it were you would still need a router that is trained on all of those domains.
- yorwba 6 days ago
  
  MoE models are typically designed for datacenter deployment, where per-token load-balancing is more important, but it's also possible to use a different training objective that encourages domain-specialization of experts: https://allenai.org/blog/emo But yes, this isn't really useful for distributed training as such because of the router.

incognito124 6 days ago

https://learning-at-home.github.io/

hajile 6 days ago

AI with blockchain. Maybe we can mix in IoT and VR for the ultimate in buzzword synergy.

slashdave 6 days ago

Well, I suppose it is understandable why you want to attack the most obvious problem with such a scheme: obtaining sufficient compute.

That does mean you are actually neglecting the more difficult issues.

AtlasBarfed 6 days ago

I just read something or saw something about in document recalculation being a completely wasted step in every single training run. Is that true

taylorhou 6 days ago

Let's collab. I'm one guy too but I built distributed inference network (teale.com) banging away for about a month with opus/gpt

monkeydust 6 days ago

Don't know but could BOINC setup which has been around for ages and mature plus has some incentive mechanism (Gridcoin) be used for this?

dominotw 6 days ago

we will be better of doing political activism for govt to provide open researchers and builders access to gpu in govt built dataceter

0xpgm 6 days ago

There are some attempts at this problem, like Bittensor, Akash Network etc

whateverboat 6 days ago

The biggest problem is accuracy and integrity of the actors in the project.

labbett 6 days ago

Sounds like SETI@home but for AGI... SAGI@home?

DonHopkins 6 days ago

Since SAGI can't be practically distributed, and it puts so many people out of work, how about moving all of the unhoused people into the nice warm data centers, and call it home@SAGI.
Or is that too close to the plot of The Matrix?

thomasjeff1 6 days ago

I believe we are not the only ones

ai_fry_ur_brain 6 days ago

[flagged]

palisade 6 days ago

Someone with AI psychosis would say it was easy. I'm saying the opposite. I'm stating that it'd be cool, but at the moment I don't see how it is feasible. And, for fun I tried to solve one small aspect of the problem.
I also didn't bring up the concept out of nowhere, this is in response to an article about open source AI. The premise of the post is releasing control to the public. What is more open than a decentralized system? And, why wouldn't you brainstorm in a comment on such a thread?
I also didn't ask an AI for the idea, it's just an idea I have. There's a difference.
bot403 6 days ago
The first half of your comment is unnecessarily aggressive and dismissive to op.
- ai_fry_ur_brain 6 days ago
  
  Okay