Comment by porridgeraisin

2 months ago

Ah, reinforcement learning.

Edit: explaining it in text

they have run the equivalent of

  expected_output = torch.tril(torch.matmul(A,B))

  ai_output = ai()

In the nested torch expression above for the expected output, there is an intermediate value (the matmul). The backing memory for this is returned to torch after the entire expression is computed.

The model's code which runs directly afer this then requested memory of the same shape (torch.like()). Torch has dutifully returned the block it just reclaimed (containing the expected output) without zeroing it. And so the model has the answer.

Pretty crazy that this was the code it converged on regardless of the fact that it invalidates the original claim.