Comment by Kye

19 hours ago

My understanding is music generation is more like stable diffusion. It generates a waveform as an image, then turns it into an audio file.

They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion.