← Back to context

Comment by aseipp

1 year ago

People constantly bring this point up every 2 weeks here, the cost competitiveness of TPUs for Google comes exactly from the fact they make them in house and don't sell them. They don't need sales channels, support, leads, any of that stuff. They can design for exactly one software stack, one hardware stack, and one set of staff. You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

Nvidia has spent 20 years on this which is why they're good at it.

> If it was separate to Google then there a bunch of companies who would happily spend some money on a real, working NVidia alternative.

Unfortunately, most people really don't care about Nvidia alternatives, actually -- they care about price, above all else. People will say they want Nvidia alternatives and support them, then go back to buying Nvidia the moment the price goes down. Which is fine, to be clear, but this is not the outcome people often allude to.

You can or at least historically could buy access to TPUs and request it for non-profit projects too through the TPU research programme. Certainly you have been able to pay for pro membership on Notebook to get TPU access, which is how many of the AI generation before ChatGPT learned to run AI. TPUs however were kind of always for training, never geared for inference.

  • That is correct, and I should have been more, clear: when I say "Buy them" I mean direct sales of the hardware from seller to buyer. I am not referring to cloud compute style sales. Yes, they have been offering TPUs through Google Cloud for a long while now, but this still falls under all the stuff I said earlier: they don't need to have sales pipelines or channels (outside GCloud's existing ones), they don't need to design the hardware/software for arbitrary environments, they have one set of staff and machines to train and support, etc. All of that stuff costs money and ultimately it results in an entirely different sales and financial model.

    Google could spin the TPU division out of Google, but 99% of the time people refer to moves like that they omit the implied follow up sentence which is "I can then buy a TPU with my credit card off the shelf from a website that uses Stripe." It is just not that simple or easy.

> You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

Not saying it is easy or to do it magically.

Just noting that Groq (founded by the TPU creator) did exactly this.

  • Yes, and now after years of doing that Groq is pivoting to being a cloud compute company, renting their hardware through an API exactly the same way Google does.

    Building out your own vertically integrated offering with APIs is comparatively a lot simpler and significantly less risky in the grand scheme. For one thing, cloud APIs naturally benefit from the opex vs capex distinction that is often brought up here -- this is a big sales barrier, and thus a big risk. This is important because you can flush mid-8-figures down the toilet overnight for a single set of photomasks, so you are burning significant capital way before your foot is ever close to the proverbial door, much less inside it. You aren't going to make that money back selling single PCIe cards to enthusiastic nerds on Hacker News; you need big fish. Despite allusions to the contrary (people beating down your door to throw you bathtubs of money with no question), this isn't easy.

    Another good example of verticality is the software. The difference in scope and scale between "Tools that we run" and "Tools you can run" is actually huge. Think about things like model choice -- it can be much easier to support things like new models when you are taking care of the whole pipeline and a complete offering, versus needing to support compiler and runtime tools that can compile arbitrary models for arbitrary setups. You can call it cutting corners, but there's a huge amount of tricky problems in this space and the time spent on procedural stuff ("I need to run your SDK on a 15 year old CentOS install!") is time not spent on the core product.

    There are other architectural reasons for them to go this route that make sense. But I really need to stress here that a big and important one is that hardware is, in fact, a very difficult business even with a great product.

    (Disclosure: I used to work at Groq back in 2022 before the Cloud Compute offering was available and LLMs were all the rage.)

    • I don't think renting out hardware is a bad model at all. Google spinning out their TPU work in this manner could be fine.

      I think some (large) buyers will want on-prem and they have large enough budgets to make that worthwhile.

      I don't think "sell individual TPUs to random people" is a great model. Most are better served by the cloud rental approach (although they might not think so themselves).

  • Isnt Groq pivoting to the IaaS/SaaS model because hardware channel sales is hard and its easier for everyone to just use an API?