Comment by yjftsjthsd-h
2 days ago
> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.
How does that work? Correlating sound with movement?
2 days ago
> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.
How does that work? Correlating sound with movement?
If it’s anything like the original SAM, thousands of hours of annotator time.
If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.
Think about it conceptually:
Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.
Could you point out who is lead guitar and who is rhythm guitar? So can AI.
I thought about it. Still seems kind of pointless.
That doesn't seem any better than typing "rhythm guitar". In fact, it seems worse and with extra steps. Sometimes the thing making the sound is not pictured. This thing is going to make me scrub through the video until the bass player is in frame instead of just typing "bass guitar". Then it will burn some power inferring that the thing I clicked on was a bass.
To be fair, it's one of 3 ways to prompt
I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....
> sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from
And in those situations it won't work. Is any of this really a surprise?