Comment by umairnadeem123

6 days ago

i buy the thesis that audio is a wedge because latency/streaming constraints are brutal, but i wonder if it's also just that evaluation is easier. with vision, it's hard to say if a model is 'right' without human taste, but with speech you can measure wer, speaker similarity, diarization errors, and stream jitter. do you think the real moat is infra (real-time) or data (voices / conversational corpora)?