Comment by almaight

6 months ago

"multi-modal feature extraction → semantic translation → cross-modal feature transfer → precise temporal alignment," is all we need

0 comments