Comment by yjftsjthsd-h

2 months ago

> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

6 comments

yjftsjthsd-h

If it’s anything like the original SAM, thousands of hours of annotator time.

If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

yodon 2 months ago

Think about it conceptually:

Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

recursive 2 months ago
I thought about it. Still seems kind of pointless.
That doesn't seem any better than typing "rhythm guitar". In fact, it seems worse and with extra steps. Sometimes the thing making the sound is not pictured. This thing is going to make me scrub through the video until the bass player is in frame instead of just typing "bass guitar". Then it will burn some power inferring that the thing I clicked on was a bass.
- yjftsjthsd-h 2 months ago
  
  To be fair, it's one of 3 ways to prompt
scarecrowbob 2 months ago
I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....
- yodon 2 months ago
  
  > sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from
  And in those situations it won't work. Is any of this really a surprise?