Comment by Mizza

3 days ago

Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

This is a good release if they're not too cherry picked!

I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

29 comments

Mizza

ianbicking 3 days ago

FWIW in my recent experience I've found LLMs are very good at reading through the transcription errors

(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)

vunderba 3 days ago
Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.
I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.
Public gist in case anyone finds it useful:
https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
- sovok 3 days ago
  
  An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.
  
  1 reply →
- Tokumei-no-hito 3 days ago
  
  thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?
  
  3 replies →
mikepurvis 3 days ago
I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.
- ianbicking 3 days ago
  
  If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.
  Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.
- miki123211 2 days ago
  
  This is actually something people used to do.
  old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.
throwawaymaths 3 days ago
do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.
- philipkiely 3 days ago
  
  WhisperX! https://github.com/basetenlabs/truss-examples/tree/main/whis...
  
  2 replies →
- iainmerrick 3 days ago
  
  Deepgram does it.
  
  1 reply →

pinter69 3 days ago

Right you are. I've used speechmatics, they do a decent jon with transcription

theyinwhy 3 days ago
1 error every 78 characters?
- pinter69 2 days ago
  
  The way to measure transcription accuracy is word error and not character error. I have not really checked or trusted) speechmatics' accuracy benchmarks But, from my experience and personal impression - it looks good, haven't done a quantitative benchmark
  
  2 replies →

causal 3 days ago

Play with the Huggingface demo and I'm guessing this page is a little cherry-picked? In particular I am not getting that kind of emotion in my responses.

backnotprop 3 days ago

It is hard to get consistent emotion with this. There are some parameters, and you can go a bit crazy, but it gets weird…

echelon 3 days ago

I absolutely ADORE that this has swearing directly in the demo. And from Pulp Fiction, too!

> Any of you fucking pricks move and I'll execute every motherfucking last one of you.

I'm so tired of the boring old "miss daisy" demos.

People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.

They know how this will be used.

[1] https://en.wikipedia.org/wiki/Copypasta

[2] https://knowyourmeme.com/memes/navy-seal-copypasta

bschwindHN 3 days ago

Heh, I always type out the first sentence or two of the Navy Seal copypasta when trying out keyboards.

lvl155 3 days ago

Can’t you get around that by synthetic data?

lukax 3 days ago

[flagged]

junon 3 days ago

You should really disclaim that you're affiliated.
https://news.ycombinator.com/item?id=41866830