Comment by indigodaddy

1 month ago

Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

2 comments

indigodaddy

simonw 1 month ago

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.

genewitch 1 month ago

the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.

anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.