← Back to context

Comment by tavavex

15 hours ago

I think that it's not just about the ratio. To me the difference is that Starlink sattelites are fixed-scope, miniature satellites that perform a limited range of tasks. When you talk about GPUs, though, your goal is maximizing the amount of compute you send up. Which means you need to push as many of these GPUs up there as possible, to the extent where you'd need huge megastructures with solar panels and radiators that would probably start pushing the limits of what in-space construction can do. Sure, the ratio would be the same, but what about the scale?

And you also need it to make sense not just from a maintenance standpoint, but from a financial one. In what world would launching what's equivalent to huge facilities that work perfectly fine on the ground make sense? What's the point? If we had a space elevator and nearly free space deployment, then yeah maybe, but how does this plan square with our current reality?

Oh, and don't forget about getting some good shielding for all those precise, cutting-edge processors.

Assuming you can stay out of the way of other satellites I'd guess you think about density in a different way to building on Earth. From a brief look at the ISS thermal system it would seem the biggest challenge would be getting enough coolant and pumping equipment in orbit for a significant wattage of compute.

Why would you need to fit the GPUs all in one structure?

You can have a swarm of small, disposable satellites with laser links between them.

  • Because the latencies required for modern AI training are extremely restrictive. A light-nanosecond is famously a foot, and the critical distances have to be kept in that range.

    And a single cluster today would already require more solar & cooling capacity than all starlink satellites combined.

  • Because that brings in the whole distributed computing mess. No matter how instantaneous the actual link is, you still have to deal with the problems of which satellites can see one another, how many simultaneous links can exist per satellite, the max throughput, the need for better error correction and all sorts of other things that will drastically slow the system down in the best case. Unlike something like Starlink, with GPUs you have to be ready that everyone may need to talk to everyone else at the same time while maintaining insane throughput. If you want to send GPUs up one by one, get ready to also equip each satellite with a fixed mass of everything required to transmit and receive so much data, redundant structural/power/compute mass, individual shielding and much more. All the wasted mass you have to launch with individual satellites makes the already nonsensical pricing even worse. It just makes no sense when you can build a warehouse on the ground, fill it with shoulder-to-shoulder servers that communicate in a simple, sane and well-known way and can be repaired on the spot. What's the point?

    • Isn't this already a major problem for AI clusters?

      I vaguely recall an article a while ago about the impact of GPU reliability: a big problem with training is that the entire cluster basically operates in lock-step, with each node needing the data its neighbors calculated during the previous step to proceed. The unfortunate side-effect is that any failure stops the entire hundred-thousand-node cluster from proceeding - as the cluster grows even the tiniest failure rate is going to absolutely ruin your uptime. I think they managed to somehow solve this, but I have absolutely no idea how they managed to do it.

    • Starlink already solved those problems, they do 200 GBit/s via laser between satellites.

      And for data centers, the satellite wouldn't be as far apart as starlight satellites, they would be quite close instead.

      1 reply →