Comment by killerstorm
6 days ago
I'm curious why smallish TTS models have metallic voice quality.
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
We change our tone based on personal style, emotion, context, and other factors. An accurate generator might need to encode all that information in the model. It will be larger than a model that doesn't do all of that.
Probably "metallicity" is due to lack of details and cannot be fixed that easy.