Comment by dsrtslnd23

1 day ago

do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.

Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.

The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.

I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.

  • An RTF above 1 for just 0.6B parameters suggests the bottleneck isn't the GPU, even on a 1080. The raw compute should be much faster. I'd bet it's mostly CPU overhead or an issue with the serving implementation.

  • you can install flash attention, et al, but if you're on windows, afaik, you can't use/run/install "triton kernels", which apparently make audio models scream. Whisper complains every time i start it, and it is pretty slow; so i just batch hundreds of audio files on a machine in the corner with a 3060 instead. technically i could batch them on a CPU, too, since i don't particularly care when they finish.