Comment by daemonologist

11 hours ago

It's interesting to me that all AI music sounds slightly sibilant - like someone taped a sheet of paper to the speaker or covered my head in dry leaves. I know no model is perfect but I'd have thought they'd have ironed out this problem by now, given how pervasive it is and how significantly it degrades the end product.

I've noticed this too. I have a few theories about this. Disclosure: I know a little about audio, and very little about audio generative AI.

First, perhaps the models are trained on relatively low-bitrate encodings. Just like image generations sometimes generate JPG artifacts, we could be hearing the known high-frequency loss of low data rate encodings. Another idea is that 'S' and 'T' sounds and similar are relatively broad-spectrum sounds. Not unlike white noise. That kind of sound is known to be difficult to encode for lossy frequency-domain encoding schemes. Perhaps these models work in a similar domain and are subject to similar constraints. Perhaps there's a balance here of low-pass filter vs. "warbly" sounds, and we're hearing a middle ground compromise.

I don't know how it happens, but when I hear the "AI" sound in music, this is usually one of the first tells.

Agreed. I find that particularly annoying, and I also seem to find that the spatial arrangement or stereo effect is muted for most instruments (or the model simply doesn't use that feature as well as a good human musician).

I suspect it's because AI generates music as a waveform incrementally not globally so it favors smoothly varying sounds, not sharp contrast. If it generated MIDI data and then used a MIDI synth to create the audio, you wouldn't get that.