> Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate
DGX Spark has 128 GB and only 273 GB/s BW. Are we lucky that NVIDIA did ship something even worse than what you specified? I'm confused.
People have been complaining [1] about how little VRAM NVIDIA ships with their GPUs for decades. Their whole game has been "oh, you want more VRAM? Buy more or pay us 50x for server grade with 10x as much VRAM. The more you buy, the more you save."
Apple did everyone a solid by shipping something way out of that distribution. We now know more than we did before! We know that a 284B parameter model with 13B active params (or 35B with 3B active, or 671B with 37B active) can outperform a 2T model and draw a fraction as much power. How can you think that's a bad thing?
You could point out that Apple didn't invent the idea of MoE. Everyone knows that. But other than Macs, there simply were no machines with >100GB VRAM directly coupled to ~50 TFLOP/s of compute until the DGX Spark last Dec. If you wanted to run a model with more than 32 GB of weights, you had to either pay up for dozens of GPUs idling at hundreds of watts or really pay up for some $50,000 server GPUs idling at... also 100-200W each.
I feel lucky to have a $3k machine on my shelf that can run DS4-Flash with 1M context at 20t/s while drawing ~150W and making very little noise. The best part? It idles at 30W with DS4 loaded, dropping to 6W after a reboot. There isn't a single GPU on the market that can match that in the same shoebox volume.
We're quite lucky that Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate, is my point.
> Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate
DGX Spark has 128 GB and only 273 GB/s BW. Are we lucky that NVIDIA did ship something even worse than what you specified? I'm confused.
People have been complaining [1] about how little VRAM NVIDIA ships with their GPUs for decades. Their whole game has been "oh, you want more VRAM? Buy more or pay us 50x for server grade with 10x as much VRAM. The more you buy, the more you save."
Apple did everyone a solid by shipping something way out of that distribution. We now know more than we did before! We know that a 284B parameter model with 13B active params (or 35B with 3B active, or 671B with 37B active) can outperform a 2T model and draw a fraction as much power. How can you think that's a bad thing?
You could point out that Apple didn't invent the idea of MoE. Everyone knows that. But other than Macs, there simply were no machines with >100GB VRAM directly coupled to ~50 TFLOP/s of compute until the DGX Spark last Dec. If you wanted to run a model with more than 32 GB of weights, you had to either pay up for dozens of GPUs idling at hundreds of watts or really pay up for some $50,000 server GPUs idling at... also 100-200W each.
I feel lucky to have a $3k machine on my shelf that can run DS4-Flash with 1M context at 20t/s while drawing ~150W and making very little noise. The best part? It idles at 30W with DS4 loaded, dropping to 6W after a reboot. There isn't a single GPU on the market that can match that in the same shoebox volume.
[1] https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlOW0N...