Comment by tristor

3 months ago

> What would you build if on-device AI were genuinely as fast as cloud?

I think this has to be the future for AI tools to really be truly useful. The things that are truly powerful are not general purpose models that have to run in the cloud, but specialized models that can run locally and on constrained hardware, so they can be embedded.

I'd love to see this able to be added in-path as an audio passthrough device so you can add on-device native transcriptioning into any application that does audio, such as in video conferencing applications.

This is a great idea. A virtual audio device that sits in the path of any audio stream and provides live transcription, that would be huge for video conferencing, lectures, podcasts.

MetalRT's STT numbers make this feasible: 70 seconds of audio transcribed in 101ms means you could process audio chunks in real-time with massive headroom. The latency would be imperceptible.

We haven't built this yet but it's a compelling use case. CoreAudio supports virtual audio devices (aggregate devices) that could pipe audio through the pipeline. If anyone in this thread has experience building macOS audio HAL plugins and wants to collaborate, we're very open to contributions, RCLI is MIT.

  • Something that could be possible is serving the model as a virtual audio device and then you can use existing tools on macOS like Rogue Amoeba's Loopback to direct audio to split to that virtual device and your other output (you'd configure your Loopback device as the output in your system audio settings).

    I have never written audio drivers on macOS, but maybe something worth exploring to see if I can make this happen. I really appreciate high quality AI transcripts in my meetings, but right now only Webex has good transcriptioning, and a lot of meetings use other services like MS Teams, Zoom, Meet, et al.