Comment by pj_mukh

1 year ago

Super cool! Didn't realize OpenAI is just using LiveKit.

Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It's like $9/hr!

It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).

What's the trade off here?

[1]: https://platform.openai.com/docs/api-reference/streaming

7 comments

pj_mukh

davidz 1 year ago

Currently it does: all audio is sent to the model.

However, we are working on turn detection within the framework, so you won't have to send silence to the model when the user isn't talking. It's a fairly straight forward path to cutting down the cost by ~50%.

rukuu001 1 year ago
Working on this for an internal tool - detecting no speech has been a PITA so far. Interested to see how you go with this.
- balloob 1 year ago
  
  Use the voice activity detector we wrote for Home Assistant. It works very well: https://github.com/rhasspy/pymicro-vad
  
  1 reply →
- davidz 1 year ago
  
  currently we are using silero VAD to detect speech: https://github.com/livekit/agents/blob/main/livekit-plugins/...
  it works well for voice activity; though it doesn't always detect end-of-turn correctly (humans often pause mid-sentence to think). we are working on improving this behavior.
pj_mukh 1 year ago

Can I currently put a VAD module in the pipeline and only send audio when there is an active conversation? Feel like just that would solve the problem?

npace12 1 year ago

You dont get charged per hour with the openai realtime api, only for tokens from detected speech and response