← Back to context

Comment by krisoft

7 years ago

Perfect alignment, or any alignment for that matter, is not necessary. Check out adversarial networks. CycleGAN can be trained to do similar feats in image domain without aligned inputs. Shouldn't be hard to adopt it to audio.

I didn't say "impossible", merely "not simple". The minute you bring in a GAN, things are already not simple. Also, I'm not aware of any work on a GAN that works with a network that does conditional sampling (like LPCNet/WaveNet), so it would mean starting from scratch.