Comment by sanchitmonga22

3 months ago

This is a great idea. A virtual audio device that sits in the path of any audio stream and provides live transcription, that would be huge for video conferencing, lectures, podcasts.

MetalRT's STT numbers make this feasible: 70 seconds of audio transcribed in 101ms means you could process audio chunks in real-time with massive headroom. The latency would be imperceptible.

We haven't built this yet but it's a compelling use case. CoreAudio supports virtual audio devices (aggregate devices) that could pipe audio through the pipeline. If anyone in this thread has experience building macOS audio HAL plugins and wants to collaborate, we're very open to contributions, RCLI is MIT.

Something that could be possible is serving the model as a virtual audio device and then you can use existing tools on macOS like Rogue Amoeba's Loopback to direct audio to split to that virtual device and your other output (you'd configure your Loopback device as the output in your system audio settings).

I have never written audio drivers on macOS, but maybe something worth exploring to see if I can make this happen. I really appreciate high quality AI transcripts in my meetings, but right now only Webex has good transcriptioning, and a lot of meetings use other services like MS Teams, Zoom, Meet, et al.