Comment by yunwal
2 days ago
This is hilariously bad with music. Like I can type in the most basic thing like "string instruments" which should theoretically be super easy to isolate. You can generally one-shot this using spectral analysis libraries. And it just totally fails.
I had the same experience. It did okay at isolating vocals but everything else it failed or half-succeeded at.
Like most models released for publicity rather than usefulness, they'll do great at benchmarks and single specific use cases, but no one seem to be able to release actually generalized models today.
Use Demucs bruh https://github.com/adefossez/demucs
Hilarious that this is maintained by facebook and yet SAM fails so badly
Like everything AI you just have to lie a little and people whith 0 clue abot SOTA in audio will think this is amazing.
what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with, it takes years to train one of them to do it mildly well. Computers are even worse - blind source separation and the cocktail party problem have been the white whale of audio DSP for decades (and only very recently did tools become passable).
The fact that you can do it with spectral analysis libraries, no LLM required.
This is much easier than source separation. It would be different if I were asking to isolate a violin from a viola or another violin, you’d have to get much more specific about the timbre of each instrument and potentially understand what each instruments part was.
But a vibration made from a string makes a very unique wave that is easy to pick out in a file.
Are you making this up? What spectral analysis libraries or tools?
String instruments create similar harmonic series to horns, winds, and voice (because everything is a string in some dimension) and the major differences are in the spectral envelope, something that STFT tools are just ok at approximating because of the time/frequency tradeoff (aka: the uncertainty principle).
This is a very hard problem "in theory" to me, and I'm just above casually versed in it.
6 replies →
>what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with,
Humans are amazing at it. You can discern the different instruments way better than any stem separating AI.