Comment by TheAceOfHearts

1 day ago

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

  • For the system prompt I used:

    > Read this in a calm, clear, and wise audiobook tone.

    > Do not rush. Allow the meaning to sink in.

    But maybe I should experiment with something more detailed. Do you have any suggestions?

    • Something like this:

      Character Name: Marcus Cole Voice Profile: A bright, agile male voice with a natural upward lift, delivering lines at a brisk, energetic pace. Pitch leans high with spark, volume projects clearly—near-shouting at peaks—to convey urgency and excitement. Speech flows seamlessly, fluently, each word sharply defined, riding a current of dynamic rhythm. Background: Longtime broadcast booth announcer for national television, specializing in live interstitials and public engagement spots. His voice bridges segments, rallies action, and keeps momentum alive—from voter drives to entertainment news. Presence: Late 50s, neatly groomed, dressed in a crisp shirt under studio lights. Moves with practiced ease, eyes locked on the script, energy coiled and ready. Personality: Energetic, precise, inherently engaging. He doesn’t just read—he propels. Behind the speed is intent: to inform fast, to move people to act. Whether it’s “text VOTE to 5703” or a star-studded tease, he makes it feel immediate, vital.

do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.

  • Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.

    The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.

    I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.

    • An RTF above 1 for just 0.6B parameters suggests the bottleneck isn't the GPU, even on a 1080. The raw compute should be much faster. I'd bet it's mostly CPU overhead or an issue with the serving implementation.

    • you can install flash attention, et al, but if you're on windows, afaik, you can't use/run/install "triton kernels", which apparently make audio models scream. Whisper complains every time i start it, and it is pretty slow; so i just batch hundreds of audio files on a machine in the corner with a 3060 instead. technically i could batch them on a CPU, too, since i don't particularly care when they finish.