← Back to context

Comment by sgt

9 hours ago

While on this subject, what's the go to transcribe speech to text model (open source or proprietary, doesn't matter) if you have to support a lot of languages really well?

If propeietary/SaaS fits your use case I can reccomend Speechmatics. Has a wider range of languages than a lot of the competition: https://speechmatics.com

(Full disclosure I'm an engineer there)

  • Will it work with say - someone speaking English with some hindi mixed in? I'm not from there so I'm not sure how prevalent that is, but I've been told it's quite common to "mix it up" in India, and I need to probably cater for that use case.

    PS if you can share your email I'll pop you an email about Speechmatics. I tried the English version and it's impressive.

I spent a few days on similar scenario without much success (scenario where one person speaks and then their speech is translated, and I want juts the original or both).

An API call to GPT4o works quite well (it basically handles both transcription and diarization), but I wanted a local model.

Whisper is really good for 1 person speaking. With more people you get repetitions. Qwen and other open multimodal models gives subpar results.

I tried multipass approach, with the first one identifying the language and chunking and the next one the actual transcription, but this tended to miss a lot of content.

I'm going to give canary-1b-v2 a try next weekend. But it looks like in spite of enormous development in other areas, speech recognition stalled since Whisper's release (more than 3 years already?).