Comment by in-silico

1 month ago

Why can't you just leave H_res as the identity matrix (or just not use it at all)? In that case, the model is basically a ResNet again and you don't need to worry about exploding/vanishing gradients from H_res.

I would think that H_post and H_pre could cover the lost expressiveness.