Comment by dan-robertson
7 days ago
I think it’s some combination of:
- the models are not too big for the cards. Specifically, they know the cards they have and they modify the topology of the model to fit their hardware well
- lots of optimisations. Eg the most trivial implementation of transformer-with-attention inference is going to be quadratic in the size of your output but actual implementations are not quadratic. Then there are lots of small things: tracing the specific model running on the specific gpu, optimising kernels, etc
- more costs are amortized. Your hardware is relatively expensive because it is mostly sitting idle. AI company hardware gets much more utilization and therefore can be relatively more expensive hardware, where customers are mostly paying for energy.
No comments yet
Contribute on Hacker News ↗