Comment by mnky9800n

19 hours ago

I really don't understand the argument that nvidia GPUs only work for 1-3 years. I am currently using A100s and H100s every day. Those aren't exactly new anymore.

It’s not that they don’t work. It’s how businesses handle hardware.

I worked at a few data centers on and off in my career. I got lots of hardware for free or on the cheap simply because the hardware was considered “EOL” after about 3 years, often when support contracts with the vendor ends.

There are a few things to consider.

Hardware that ages produce more errors, and those errors cost, one way or another.

Rack space is limited. A perfectly fine machine that consumes 2x the power for half the output cost. It’s cheaper to upgrade a perfectly fine working system simply because it performs better per watt in the same space.

Lastly. There are tax implications in buying new hardware that can often favor replacement.

  • I’ll be so happy to buy a EOL H100!

    But no, there’s none to be found, it is a 4 year, two generations old machine at this point and you can’t buy one used at a rate cheaper than new.

    • Well demand is so high currently that it's likely this cycle doesn't exist yet for fast cards.

      For servers I've seen where the slightly used equipment is sold in bulk to a bidder and they may have a single large client buy all of it.

      Then around the time the second cycle comes around it's split up in lots and a bunch ends up at places like ebay

      1 reply →

    • There’s plenty on eBay? But at the end of your comment you say “a rate cheaper than new” so maybe you mean you’d love to buy a discounted one. But they do seem to be available used.

      1 reply →

  • > Rack space is limited.

    Rack space and power (and cooling) in the datacenter drives what hardware stays in the datacenter

  • Do you know how support contract lengths are determined? Seems like a path to force hardware refreshes with boilerplate failure data carried over from who knows when.

The common factoid raised in financial reports is GPUs used in model training will lose thermal insulation due to their high utilization. The GPUs ostensibly fail. I have heard anecdotal reports of GPUs used for cryptocurrency mining having similar wear patterns.

I have not seen hard data, so this could be an oft-repeated, but false fact.

  • It's the opposite actually - most GPU used for mining are run at a consistent temp and load which is good for long term wear. Peaky loads where the GPU goes from cold to hot and back leads to more degradation because of changes in thermal expansion. This has been known for some time now.

    • That is commonly repeated idea, but it doesn't take into account countless token farms which are smaller than a datacenter. Basically anything from a single MB with 8 cards to a small shed with rigs, all of which tend to disregard common engineering practices and run hardware into a ground to maximize output until next police raid or difficulty bump. Plenty of photos in the internet of crappy rigs like that, and no one guarantees which GPU comes whom where.

      Another commonly forgotten issue is that many electrical components are rated by hours of operation. And cheaper boards tend to have components with smaller tolerances. And that rated time is actually a graph, where hour decrease with higher temperature. There were instances of batches of cards failing due to failing MOSFETs for example.

      8 replies →

  • > I have heard anecdotal reports of GPUs used for cryptocurrency mining having similar wear patterns.

    If this was anywhere close to a common failure mode, I'm pretty sure we'd know that already given how crypto mining GPUs were usually ran to the max in makeshift settings with woefully inadequate cooling and environmental control. The overwhelming anecdotal evidence from people who have bought them is that even a "worn" crypto GPU is absolutely fine.

  • I can't confirm that fact - but it's important to acknowledge that consumer usage is very different from the high continuous utilization in mining and training. It is credulous that the wear on cards under such extreme usage is as high as reported considering that consumers may use their cards at peak 5% of waking hours and the wear drop off is only about 3x if it is used near 100% - that is a believable scale for endurance loss.

1-3 is too short but they aren’t making new A100s, theres 8 in a server and when one goes bad what do you do? you wont be able to renew a support contract. if you wanna diy you eventually you have to start consolidating pick and pulls. maybe the vendors will buy them back from people who want to upgrade and resell them. this is the issue we are seeing with A100s and we are trying to see what our vendor will offer for support.

They're no longer energy competitive. I.e. the amount of power per compute exceeds what is available now.

It's like if your taxi company bought taxis that were more fuel efficient every year.

  • Margins are typically not so razor thin that you cannot operate with technology from one generation ago. 15 vs 17 mpg is going to add up over time, but for a taxi company it's probably not a lethal situation to be in.

  • If a taxi company did that every year, they'd be losing a lot of money. Of course new cars and cards are cheaper to operate than old ones, but is that difference enough to offset buying a new one every one to three years?

    • >If a taxi company did that every year, they'd be losing a lot of money. Of course new cars and cards are cheaper to operate than old ones, but is that difference enough to offset buying a new one every one to three years?

      That's where the analogy breaks. There are massive efficiency gains from new process nodes, which new GPUs use. Efficiency improvements for cars are glacial, aside from "breakthroughs" like hybrid/EV cars.

    • >offset buying a new one every one to three years?

      Isn't that precisely how leasing works? Also, don't companies prefer not to own hardware for tax purposes? I've worked for several places where they leased compute equipment with upgrades coming at the end of each lease.

      6 replies →

    • If there was a new taxi every other year that could handle twice as many fares, they might. That’s not how taxis work but that is how chips work.

  • Nvidia has plenty of time and money to adjust. They're already buying out upstart competitors to their throne.

    It's not like the CUDA advantage is going anywhere overnight, either.

    Also, if Nvidia invests in its users and in the infrastructure layouts, it gets to see upside no matter what happens.

Not saying your wrong. A few things to consider:

(1) We simply don't know what the useful life is going to be because of how new the advancements of AI focused GPUs used for training and inference.

(2) Warranties and service. Most enterprise hardware has service contracts tied to purchases. I haven't seen anything publicly disclosed about what these contracts look like, but the speculation is that they are much more aggressive (3 years or less) than typical enterprise hardware contracts (Dell, HP, etc.). If it gets past those contracts the extended support contracts can typically get really pricey.

(3) Power efficiency. If new GPUs are more power efficient this could be huge savings on energy that could necessitate upgrades.

  • Nvidia is moving to a 1 year release life cycle for data center, and in Jensen's words once a new gen is released you lose money for being on the older hardware. It makes no longer financially sense to run it.

  • based on my napkin math, an H200 needs to run for 4 years straight at maximum power (10.2 kW) to consume its own price of $35k worth of energy (based on 10 cents per kWh)

If power is the bottleneck, it may make business sense to rotate to a GPU that better utilizes the same power if the newer generation gives you a significant advantage.

From an accounting standpoint, it probably makes sense to have their depreciation be 3 years. But yeah, my understanding is that either they have long service lives, or the customers sell them back to the distributor so they can buy the latest and greatest. (The distributor would sell them as refurbished)

I think the story is less about the GPUs themselves, and more about the interconnects for building massive GPU clusters. Nvidia just announced a massive switch for linking GPUs inside a rack. So the next couple of generations of GPU clusters will be capable of things that were previously impossible or impractical.

This doesn't mean much for inference, but for training, it is going to be huge.