Comment by hosaka
14 hours ago
Depending on the TTS model being used latency can be reduced further yet with an LRU cache, fetching common phrases from cache instead of generating fresh with TTS.
However the naturalness of how it sounds will depend on how the TTS model works and whether two identical chunks of text will sound alike every generation.
Yep. Seems like caching more broadly is something worth exploring next if I were to do a pt2.