← Back to context

Comment by sho

2 days ago

As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.

The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.

And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.

It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.

AI hardware is for inference, not training. Training uses normal HPC crap. Superpods aren't really power efficient, it's kind of a meme, and it stems from limiting the power draw of other components by having less of them. It's more of a rounding error.

> you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.

Costs spread over a large population, it really doesn't matter. You're not getting hundreds of thousands of people to pitch half their monthly electric bill to pay for someone else's datacenter. They will pay the electricity themselves quite happily though, if all they need to do is give you compute. This isn't new.

Interconnect is the bottleneck for distributed training, nothing else really.

  • You got it wrong. Inference can use crap GPU's. Training needs the 100x more expensive big guns. Our training machine is 100x more expensive than our inference machine.

    • How is the result of training stored? How big is that? It seems reasonable to assume we’ll eventually plateau and all we’ll need is relatively infrequent training.

      2 replies →

  • > AI hardware is for inference, not training

    Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"

    > Superpods aren't really power efficient

    Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that

    Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.

  • Bit of a doozie though, that one.

    I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.

Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.

  • I guess this was more related to syncing GPUs.

    If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.

    But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.

    You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.

    • You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.

      1 reply →

  • The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.

Could you put some numbers and examples behind the efficiency gap between data center and consumer-grade AI hardware? Did you include examples like the RTX Spark on the consumer side? I was always amazed at the low power consumption of unified memory style architectures. In absolute terms and even more so compared to consumer-grade GPUs. I'd be genuinely interested in a comparison with data-center-grade hardware.

  • DGX Spark is effectively prosumer hardware, better than most consumer stuff but still not comparable to actual datacenter gear. You can't just look at TDP in isolation without also comparing performance.

  • It's more than the raw hardware, it's the interconnect and communication between the hardware at scale. These models are trained on hundreds of thousands of GPUs today. You _will_ start to see cross-datacenter training runs but this needs to efficiently decide when and how to communicate across datacenter, which bears a very high cost compared to intra-datacenter communication.

Dunno, in a sense, torrents came among similar restrictions. Everything at consumer level was just plain awful and at dial up level, mebbe ISDN if you were very lucky, with fiber only available to ridiculously rich people and corps. But with restrictions, came approaches on how to mitigate them.

  • Yes but not violations of the laws of physics. You need extremely fast communications, memory bandwidth, etc; you cannot get that with distributed training. You're up against the speed of light and the interconnect that powers the internet. You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.

    • << You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.

      Agree about the physics; disagree about the larger point.

      I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.

      << you cannot get that with distributed training

      This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.

      2 replies →

  • If weights can't be looked at almost instantly in bulk, it just doesn't work. It's a different problem from distributing file downloads.

    • I used it as an example. I understand the problem is hard. My larger point was that this is exactly how actual progress tends to take place. Well, that and porn.

> It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.

100% agree. The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it. The people owning the compute infrastructure and capturing more profit from AI at that layer is the safest, cleanest way to increase revenue capture, a sovereign wealth fund is a mediocre idea because it's possible to play shell game with stocks and redirect profit/debt (venture capital is quite good at this!).

  • >> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it

    Currently AI has generated no profit. And as it sits, is a non viable business.

    I refuse to include the sellers of shovels as AI revenue.

    If the companies buying the shovels are still losing money, then the tool supplier fortunes have nothing to do with the economics of the AI application layer, who is losing money on every prompt.

    • It's the most naive opinion that keeps getting shoveled around. You have a product that is viewed as essential by businesses, with revenue growing by 10x a year and geopolitical ramifications that have continued to rear their heads and your opinion is "this is all an unprofitable shill". It is extraordinary to me that people really believe this. Whether or not labs run at a loss today is absolutely irrelevant. There is of course steady state economics that make sense, and its currently not well known what the profitability picture is right now, so to say "Currently AI has generated no profit" is also just speculation and not a very insightful one at that.

      1 reply →

    • I've heard that the API calls by themselves are ~60% profit if you ignore capital expenditures. The labs haven't generated profit because they're constantly sinking money into the next generation of larger models to stay relevant. Dario has talked about the economics of this a lot, and I do believe him there.

      There's clearly also a lot of pent up demand in the corporate world for inference, the problem is that it's currently expensive enough that enterprises are balking at the cost before they've had a chance to refine processes and see projects through to fruition. That's a tractable problem to solve though.

      2 replies →

  • > The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it.

    Any actual numbers to back this up? I don't see how nationalizing a very cutting edge technology outside of wartime is going to go super well. The leverage that these companies have is the same leverage that TSMC has: you can't just take over and expect things to rocket at the pace its going

WRT government data centers, there is certainly precedent for independent researchers getting HPC time on systems owned by US national labs, research institutions, universities, and then publishing their results as part of the public good.

One would question why this hasn't already happened as the rule and as opposed to the proliferation of private data centers. However, I am sure the answers are plain and perhaps saddening to us all.

What makes you think Deepseek or GLM won't catch up to Fable level? Why would there be a break in the trend now?

  • DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.

    • Not true. Big models buy you baked in knowledge and long context cohesion. A model can be trained to use search and knowledge base tools more efficiently to mitigate the former, and harnesses/workflows can be designed to push models into small parallel threads to mitigate the latter.

      The thing that big models will always bring to the table is the ability to YOLO weak/under-specified prompts, and spend less time in the loop making sure work gets partitioned correctly. For smaller/simpler tasks the P(success) difference isn't that big.

      5 replies →

    • > They're not close to Opus or the latest GPT yet

      Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.

  • The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.

  • I think there are at least few question marks.

    One being that extrapolating from like 3 data points is hardly science. All trends break at some point.

    The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.

> It would be better for governments to buy and own their own datacenters,

I mean thats good, but they'd have to also build thier own dataset. Which involves either paying people, or breaking the law.

Plus if they do manage to make it work, they will not get any tax revenue from it, as it'll remove the need for labour, which is where a huge amount of tax revenues come from.

its a deeply hard problem with lots of second/third order effects.

> As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.

The first part is not really true though, the chips are not that much faster, the DRAM is not that much faster, and in aggregate it does not matter because there is just so much more consumer hardware out there (although perhaps that is changing as supply shifts toward datacenters).

The interconnect and data locality is the problem. If you could train it like e.g. you can render a scene with monte carlo ray tracing, any result from any node could be merged with any other and the combined result would have converged closer to the limit. I am sure research in that direction exists, it just has not proven effective within the scales it has been attempted.