Comment by adzm

3 months ago

A great example of this is changing music into an image and using that to train and generate new images that get converted back into music. It was surprisingly successful. I think this approach is still used by the current music generators.

13 comments

adzm

efskap 3 months ago

The current music generators use next token prediction, like LLMs, not image denoising.

[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")

[1] https://arxiv.org/abs/2306.05284

bjourne 3 months ago

You are talking about piano roll notation, I think. While it's 2d data, it's not quite the same as actual image data. E.g., 2d conv and pooling operations are useless for music. The patterns and dependencies are too subtle to be captured by spatial filters.

adzm 3 months ago
I am talking about using spectrograms (Fourier transform into frequency domain then plotted over time) that results in a 2d image of the song, which is then used to train something like stable diffusion (and actually using stable diffusion by some) to be able to generate these, which is then converted back into audio. Riffusion used this approach.
- mrheosuper 3 months ago
  
  IF you think about it, a music sheet is just a graph of Fourier transform. It shows at any points of time, what frequency is present (the pitch of note), and for how long (duration of note),
  
  1 reply →
- bjourne 3 months ago
  
  A spectrogram is lossy and not a one-to-one mapping of the waveform. Riffusion is, afaik, limited to five-second-clips. For these, structure and coherence over time isn't important and the data is strongly spatially correlated. E.g., adjacent to a blue pixel is another blue pixel. To the best of my knowledge no models synthesize whole songs from spectrograms.
- Mistletoe 3 months ago
  
  How does Spotify “think” about songs when it is using its algos to find stuff I like?
  
  5 replies →
yberreby 3 months ago

I've seen this approach applied to spectrograms. Convolutions do make enough sense there.