Comment by sovok
3 days ago
An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.
Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.