Comment by rhdunn

1 month ago

AI in this sense means using Machine Learning (ML)/Neural Networks (NN) to convert the text (or phonemes) to audio.

There are effectively two approaches to voice synthesis: time-domain and pitch-domain.

In time-domain synthesis you care concatenating short waveforms together. These are variations of Overlap and Add: OLA [1], PSOLA [2], MBROLA [3], etc.

In pitch-domain synthesis, the analysis and synthesis happens in the pitch domain through the Fast Fourier Transform (visualized as a spectrogram [4]), often adjusted to the Mel scale [5] to better highlight the pitches and overtones. The TTS synthesizer is then generating these pitches and converting them back to the time domain.

The basic idea is to extract the formants (pitch bands for the fundamental frequency and overtones) and have models for these. Some techniques include:

1. Klatt formant synthesis [6]

2. Linear Predictive Coding (LPC) [7]

3. Hidden Markov Model (HMM) [8]

4. WaveGrad NN/ML [9]

[1] https://en.wikipedia.org/wiki/Overlap%E2%80%93add_method

[2] https://en.wikipedia.org/wiki/PSOLA -- Pitch-synchronous Overlap and Add

[3] https://en.wikipedia.org/wiki/MBROLA -- Multi-Band Resynthesis Overlap and Add

[4] https://en.wikipedia.org/wiki/Spectrogram

[5] https://en.wikipedia.org/wiki/Mel_scale

[6] https://en.wikipedia.org/wiki/Dennis_H._Klatt

[7] https://en.wikipedia.org/wiki/Linear_predictive_coding

[8] https://www.cs.cmu.edu/~awb/papers/ssw6/ssw6_294.pdf

[9] https://arxiv.org/abs/2009.00713 -- WaveGrad: Estimating Gradients for Waveform Generation