Comment by thesz
1 day ago
5 days ago: https://news.ycombinator.com/item?id=45926371
Sparse models have same quality of results but have less coefficients to process, in case described in the link above sixteen (16) times as less.
This means that these models need 8 times less data to store, can be 16 and more times faster and use 16+ times less energy.
TPUs are not all that good in the case of sparse matrices. They can be used to train dense versions, but inference efficiency with sparse matrices may be not all that great.
TPUs do include dedicated hardware, SparseCores, for sparse operations.
https://docs.cloud.google.com/tpu/docs/system-architecture-t...
https://openxla.org/xla/sparsecore
SparseCores appear to be block-sparse as opposed to element-sparse. They use 8- and 16-wide vectors to compute.
Here's another inference-efficient architecture where TPUs are useless: https://arxiv.org/pdf/2210.08277
There is no matrix-vector multiplication. Parameters are estimated using Gumbel-Softmax. TPUs are of no use here.
Inference is done bit-wise and most efficient inference is done after application of boolean logic simplification algorithms (ABC or mockturtle).
In my (not so) humble opinion, TPUs are example case of premature optimization.
They are on their 7th generation now, so presumably the architecture is being updated as needs require.