Comment by siavosh

6 days ago

Does someone know how the "semantic" embeddings are learned? That seems like perhaps the main technical challenge here.

1 comment

siavosh

From the paper, section 2.1: minimize_θ,φ,Δ ||P_φ(Δ, E_θ(x)) - sg(E_θ'(y))||_1

where

y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.