← Back to context

Comment by gglon

5 days ago

From the paper, section 2.1: minimize_θ,φ,Δ ||P_φ(Δ, E_θ(x)) - sg(E_θ'(y))||_1

where

y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.