← Back to context

Comment by Mizza

3 days ago

Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

This is a good release if they're not too cherry picked!

I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

FWIW in my recent experience I've found LLMs are very good at reading through the transcription errors

(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)

  • Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

    I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

    Public gist in case anyone finds it useful:

    https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

    • An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.

      1 reply →

  • I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.

    • If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.

      Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.

    • This is actually something people used to do.

      old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.

Right you are. I've used speechmatics, they do a decent jon with transcription

  • 1 error every 78 characters?

    • The way to measure transcription accuracy is word error and not character error. I have not really checked or trusted) speechmatics' accuracy benchmarks But, from my experience and personal impression - it looks good, haven't done a quantitative benchmark

      2 replies →

Play with the Huggingface demo and I'm guessing this page is a little cherry-picked? In particular I am not getting that kind of emotion in my responses.

  • It is hard to get consistent emotion with this. There are some parameters, and you can go a bit crazy, but it gets weird…

I absolutely ADORE that this has swearing directly in the demo. And from Pulp Fiction, too!

> Any of you fucking pricks move and I'll execute every motherfucking last one of you.

I'm so tired of the boring old "miss daisy" demos.

People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.

They know how this will be used.

[1] https://en.wikipedia.org/wiki/Copypasta

[2] https://knowyourmeme.com/memes/navy-seal-copypasta

  • Heh, I always type out the first sentence or two of the Navy Seal copypasta when trying out keyboards.