Comment by janalsncm

9 months ago

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file

Transcribe it locally using whisper and output tokens/sec?

Just count syllables per second by doing an FFT plus some basic analysis.

  • > FFT plus some basic analysis

    Yeah, totally easier than `len(transcribe(a))/len(a)`

    • Maybe not as quick to code up but way faster to calculate.

      The tokens/second can be used as ground truth labels for a fft->small neural net model.