Comment by dvfjsdhgfv

9 hours ago

I spent a few days on similar scenario without much success (scenario where one person speaks and then their speech is translated, and I want juts the original or both).

An API call to GPT4o works quite well (it basically handles both transcription and diarization), but I wanted a local model.

Whisper is really good for 1 person speaking. With more people you get repetitions. Qwen and other open multimodal models gives subpar results.

I tried multipass approach, with the first one identifying the language and chunking and the next one the actual transcription, but this tended to miss a lot of content.

I'm going to give canary-1b-v2 a try next weekend. But it looks like in spite of enormous development in other areas, speech recognition stalled since Whisper's release (more than 3 years already?).