Comment by nl

1 year ago

It's crazy that Google doesn't spin-out their TPU work as a separate company.

TPUs are the second most widely used environment for training after Nvidia. It's the only environment that people build optimized kernels for outside CUDA.

If it was separate to Google then there a bunch of companies who would happily spend some money on a real, working NVidia alternative.

It might be profitable from day one, and it surely would gain substantial market capitalization - Alphabet shareholders should be agitating for this!

People constantly bring this point up every 2 weeks here, the cost competitiveness of TPUs for Google comes exactly from the fact they make them in house and don't sell them. They don't need sales channels, support, leads, any of that stuff. They can design for exactly one software stack, one hardware stack, and one set of staff. You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

Nvidia has spent 20 years on this which is why they're good at it.

> If it was separate to Google then there a bunch of companies who would happily spend some money on a real, working NVidia alternative.

Unfortunately, most people really don't care about Nvidia alternatives, actually -- they care about price, above all else. People will say they want Nvidia alternatives and support them, then go back to buying Nvidia the moment the price goes down. Which is fine, to be clear, but this is not the outcome people often allude to.

  • You can or at least historically could buy access to TPUs and request it for non-profit projects too through the TPU research programme. Certainly you have been able to pay for pro membership on Notebook to get TPU access, which is how many of the AI generation before ChatGPT learned to run AI. TPUs however were kind of always for training, never geared for inference.

    • That is correct, and I should have been more, clear: when I say "Buy them" I mean direct sales of the hardware from seller to buyer. I am not referring to cloud compute style sales. Yes, they have been offering TPUs through Google Cloud for a long while now, but this still falls under all the stuff I said earlier: they don't need to have sales pipelines or channels (outside GCloud's existing ones), they don't need to design the hardware/software for arbitrary environments, they have one set of staff and machines to train and support, etc. All of that stuff costs money and ultimately it results in an entirely different sales and financial model.

      Google could spin the TPU division out of Google, but 99% of the time people refer to moves like that they omit the implied follow up sentence which is "I can then buy a TPU with my credit card off the shelf from a website that uses Stripe." It is just not that simple or easy.

      1 reply →

  • > You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

    Not saying it is easy or to do it magically.

    Just noting that Groq (founded by the TPU creator) did exactly this.

    • Yes, and now after years of doing that Groq is pivoting to being a cloud compute company, renting their hardware through an API exactly the same way Google does.

      Building out your own vertically integrated offering with APIs is comparatively a lot simpler and significantly less risky in the grand scheme. For one thing, cloud APIs naturally benefit from the opex vs capex distinction that is often brought up here -- this is a big sales barrier, and thus a big risk. This is important because you can flush mid-8-figures down the toilet overnight for a single set of photomasks, so you are burning significant capital way before your foot is ever close to the proverbial door, much less inside it. You aren't going to make that money back selling single PCIe cards to enthusiastic nerds on Hacker News; you need big fish. Despite allusions to the contrary (people beating down your door to throw you bathtubs of money with no question), this isn't easy.

      Another good example of verticality is the software. The difference in scope and scale between "Tools that we run" and "Tools you can run" is actually huge. Think about things like model choice -- it can be much easier to support things like new models when you are taking care of the whole pipeline and a complete offering, versus needing to support compiler and runtime tools that can compile arbitrary models for arbitrary setups. You can call it cutting corners, but there's a huge amount of tricky problems in this space and the time spent on procedural stuff ("I need to run your SDK on a 15 year old CentOS install!") is time not spent on the core product.

      There are other architectural reasons for them to go this route that make sense. But I really need to stress here that a big and important one is that hardware is, in fact, a very difficult business even with a great product.

      (Disclosure: I used to work at Groq back in 2022 before the Cloud Compute offering was available and LLMs were all the rage.)

      1 reply →

    • Isnt Groq pivoting to the IaaS/SaaS model because hardware channel sales is hard and its easier for everyone to just use an API?

The TPUs are highly integrated with the rest of the internal Google ecosystem, both hardware and software. Untangling that would be ... interesting.

  • We have a perfectly reasonable blueprint for an ML accelerator that isn't tied into the google ecosystem: nvidia's entire product line.

    Between that and the fact Google already sells "Coral Edge TPUs" [1] I'd think they could manage to untangle things.

    Whether the employees would want to be spun off or not is a different matter, of course...

    [1] https://coral.ai/products/

    • Do you think that NVidia is happy to not have an online ecosystem to tie to its GPUs, for added (sales) value? They are more than happy to entangle the GPUs with their proprietary CUDA language.

      For a large, established, quasi-monopoly company it's always more attractive to keep things inside their walled gardens. Suggesting that Google should start supporting TPUs outside Google Cloud is like suggesting that Apple should start supporting iOS on non-Apple hardware.

      2 replies →

  • Knowing what I know about big corporations, the biggest entanglement is going to be IP ownership, political constraints and promises to shareholders.

There would probably a huge demand, but would Google be able to satisfy it? Is it currently able to satisfy its own demand?

  • That would be the point of spinning it out. They could have an IPO, raise as much capital as there is in the observable Universe, and build enough fabs to satisfy all the demand.

    • That wouldn't work. Even TPUv4 was on a 7nm node and you don't just build a 7nm fab just like that. If it were that easy NVIDIA would already be building their own fabs, as they have basically raised as much capital as there is in the known universe (bigger market cap than the entire London stock exchange), but they seem to prefer to let the fab experts get on with it rather than compete with them.

      LLM AI is largely HBM bottlenecked anyway i.e. Samsung, SK Hynix and Micron are where the supply chain limits enter the picture.

      8 replies →

    • There seems to be this idea that the people who design and operate fabs are infinite, when it's actually a technically demanding job.

      We don't even have enough McDonald's employees, how the hell are we going to just suddenly have multiple companies creating fabs left and right? TSMC cannot even build their Arizona plant without a shortage of workers.

      4 replies →

    • Intel has been trying to make cutting edge fabs...and we all know how that is going.

      There is good reason nobody wants to be in the fab business.

> It's crazy that Google doesn't spin-out their TPU work as a separate company.

Not really. Google TPUs require google's specific infrastructure, and cannot be deployed out side the Google Datacenter. The software is google specific, the monetization model is google specific.

We also have no idea how profitable TPUs would actually be if a separate company. The only customer of TPUs is Google and Google Cloud.