Comment by simsla
2 months ago
Avoids vanishing gradients in deeper networks.
Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.
2 months ago
Avoids vanishing gradients in deeper networks.
Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.
No comments yet
Contribute on Hacker News ↗