← Back to context

Comment by gugagore

6 years ago

I didn't quite catch that...

It's intentionally made to be hard to listen to. OP spliced multiple copies together with a slight offset, varied the rate while recording, and intentionally selected a passage which repeats very similar words in proximity, which further increases the difficulty.

> Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis and an acoustic model (physical modelling synthesis. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.

  • I thought that the argument being made is:

    By doing formant synthesis, we can make speech sounds maximally contrastive. Part of what makes natural speech sound as it does is because the physical and linguistic process that generates it causes assimilation. The period of time during which a phoneme sounds also contains information about neighboring sounds.

    So I was expecting something easy to listen to.

  • Well either that or Espeak has broken something badly with their rate boost because wow, that's pretty terrible.

    • Standard espeak at 1000 wpm on the setting I put on the other comment.

      eSpeak NG text-to-speech: 1.49.2