Comment by unraveller

4 months ago

GPU saw a 10% improvement over the TPU

>The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix.

> on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture [0]

This company [0] claimed in 2021 they could squash inference time by 40% if google would use their light chips on TPU. Perhaps more if FFTNet does more heavy lifting.

[0]: https://scribe.rip/optalysys/attention-fourier-transforms-a-...

2 comments

unraveller

hinkley 4 months ago

I have been entertaining myself a bit lately by thinking about the ways in which some improvements to a design are very, very interesting to people when it takes 1.2 machines to do a task, not worth paying attention to when it's 6 machines to do the task, and suddenly very interesting again when it's 120 machines to do the task. There's that weird saddle point in the middle where I cannot get anyone else interested in my 20% resource improvements. It's just crickets.

Eridrus 4 months ago

4096 tokens is pretty short by today's standards for transformers too.