Comment by transcriptase

1 day ago

I’m super curious now, how does padding lead to repeatedly ending tts replies with what seem to be an actual non-speech sound effect?

If you pad your output with something that doesn't represent silence, then any outputs that happen to have a non-standard length (i.e. nearly all outputs) will end with whatever sound your padding bits represent in the model's embedding space. if "0000" represents "Whoosh," then most of your outputs will end in "whoosh."

Here's a non-AI example: If all HN comments had to be some multiple of 50 characters long and comments were padded with the letter "A," then most HN comments would look like the user was screaming at the end. AAAAAAAAAAAAAAAAAA

  • Also a decent AI example as most AI audio uses base64 encoding where AAAAAAAAA is a string of zeroes.

In addition to what Centigonal said, even if the autoencoder was trained on only speech data, an all zero vector is probably just out of distribution (decoder has never seen it before) and causes weird sounds. However, given the hallucinations we're seeing, the AE has (maybe unintentionally) likely seen a bunch of non-speech data like music and sound effects too.