← Back to context

Comment by antt

6 years ago

Typical meatbag voice. The future belongs to formant synthesis, I can easily get to 1000wmp without any loss of coherency and the sound is fundamentally inhuman.

https://ufile.io/9905a

I didn't quite catch that...

  • It's intentionally made to be hard to listen to. OP spliced multiple copies together with a slight offset, varied the rate while recording, and intentionally selected a passage which repeats very similar words in proximity, which further increases the difficulty.

    > Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis and an acoustic model (physical modelling synthesis. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.

    • I thought that the argument being made is:

      By doing formant synthesis, we can make speech sounds maximally contrastive. Part of what makes natural speech sound as it does is because the physical and linguistic process that generates it causes assimilation. The period of time during which a phoneme sounds also contains information about neighboring sounds.

      So I was expecting something easy to listen to.

Espeak's sibilance really does get painful at speeds like this. You could probably further increase your processing rate if you vary inflection a bit.

  • I've played around with the internals a bit, but I've not been able to substantially improve understandability past 1200 wpm no matter what I do.

    The settings for that text snippet were

    $ espeak -p 100 -s 1000 -v male7

    That's really the optimum for me for speeds over 650 pwm.