Comment by yorwba

1 day ago

> We also found interestingly that:

  torch.exp(q - q.detach()) * advantages.unsqueeze(1)

> is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.

The autograd engine is propagating gradients correctly, but the question is, which gradients?

You could encapsulate this as a function

  f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1)

then have f_a(a, b) be the derivative of that with respect to a, and substitute in q for both variables to get f_a(q, q).

But if you substitute to get f(q, q) first and then differentiate with respect to q, you don't get f_a(q, q), but instead f_a(q, q) + f_b(q, q), which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.

detach() is a way to say "we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards."