Comment by Kye

19 hours ago

My understanding is music generation is more like stable diffusion. It generates a waveform as an image, then turns it into an audio file.

3 comments

Kye

cubefox 19 hours ago

They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion.

corysama 18 hours ago
There technically was one experiment early on to trick Stable Diffusion into generating spectrograms that could be converted into audio. And, it worked surprisingly well.
https://web.archive.org/web/20230314190913/https://www.riffu...
https://huggingface.co/riffusion/riffusion-model-v1
But, I'd expect everything in the past 3 years to diffuse the audio waveform directly.
- Kye 17 hours ago
  
  That's probably what I was thinking of. I haven't kept up as much on non-text generative AI.