ResNet (short for Residual Network) is a deep neural network architecture that has achieved significant advancements in image recognition tasks. It was introduced by Kaiming He et al. in 2015.
The key innovation of ResNet is the use of residual connections, or skip connections, that enable the network to learn residual mappings instead of directly learning the desired underlying mappings. This addresses the problem of vanishing gradients that commonly occurs in very deep neural networks.
In a ResNet, the input data flows through a series of residual blocks. Each residual block consists of several convolutional layers followed by batch normalization and rectified linear unit (ReLU) activations. The original input to a residual block is passed through the block and added to the output of the block, creating a shortcut connection. This addition operation allows the network to learn residual mappings by computing the difference between the input and the output.
By using residual connections, the gradients can propagate more effectively through the network, enabling the training of deeper models. This enables the construction of extremely deep ResNet architectures with hundreds of layers, such as ResNet-101 or ResNet-152, while still maintaining good performance.
ResNet has become a widely adopted architecture in various computer vision tasks, including image classification, object detection, and image segmentation. Its ability to train very deep networks effectively has made it a fundamental building block in the field of deep learning.
1 of 36
More Related Content
Resnet.pptx
1. Learning with Purpose
DEEP RESIDUAL NETWORKS
Kaiming He et al, “Deep Residual Learning for Image Recognition”
Kaiming He et al, “Identity Mappings in Deep Residual Networks”
Andreas Veit et al, “Residual Networks Behave Like Ensembles of Relatively Shallow Networks ”
2. Learning with Purpose
ResNet @ILSVRC & COCO 2015 Competitions
1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd
3. Learning with Purpose
Evolution of Deep Networks
ImageNet Classification Challenge Error rates by year
ImageNet competition results show that the winning solutions have
become deeper and deeper: from 8 layers in 2012 to 200+ layers in
2016.
6. Learning with Purpose
What Does Depth Mean?
Is learning better networks as easy as stacking more layers?
Backward(Gradient flow)
7. Learning with Purpose
• The multiplying property of gradients causes the phenomenon
• This can be addressed by:
– Normalized Initialization
– Batch Normalization
– Appropriate activation function
• Sigmoid(x) ReLu(x)
Gradient Vanishing
8. Learning with Purpose
• Plain networks on Cifar-10
Simply Stacking Layers?
• Plain nets: stacking 3*3 conv layers…
• 56-layer net has higher training error and test error than 20-layer net
9. Learning with Purpose
Performance Saturation/Degradation
• Overly deep plain nets have higher training error
• A general phenomenon, observed in many datasets.
10. Learning with Purpose
a shallower
model (18
layers)
a deeper
counterpart
(34 layers)
• Richer solution space
• A deeper model should not have higher training
error
• A solution by construction:
• Original layers: copied from a trained
shallower model
• Extra layers: set as identity
• At least the same training error
• Optimization difficulties: solvers cannot find the
solution when going deeper…
11. Learning with Purpose
• Keep it simple
• Base on VGG Phylosophy
– All 3*3 conv(almost)
– Spatial size /2 => # filters*2
– Simple design; just deep!
Network Design
13. Learning with Purpose
• Define H(x)=F(x)+x, the stacked weight layers try to approximate F(x)
instead of H(x).
Residual Learning Block
If the optimal function is closer to an identity
mapping, it should be easier for the solver to find
the perturbations with reference to an identity
mapping, than to learn the function as a new one
Introduce neither extra parameter nor computation complexity
Element-wise addition is performed on all feature maps
14. Learning with Purpose
• We turn the ReLu activation function after the addition into an identity
mapping
The Insight of Identity Mapping
identity
If f is also an identity mapping: x(l+1) ≡ yl
15. Learning with Purpose
• Any xl is directly forward-propagation
to any xL, plus residual.
• Any xl is additive outcome
• In contrast to the multiplicity:
Smooth Forward Propagation
Plain network,
Ignoring BN and ReLU
16. Learning with Purpose
• The gradient flow is also in the form of
addition.
• The gradient of any layer is unlikely to
vanish
• In contrast to the multiplicity:
Smooth Backward Propagation
18. Learning with Purpose
If Scaling the Shortcut
For an extremely deep network (L is large), if for all i, this factor can be exponentially large;
If for all i, this factor can be exponentially small and vanish
19. Learning with Purpose
• The gating should increase
the representation ability
(parameter increases)
• It’s the optimization rather
than the representation
dominates the results
If Gating the Shortcut
21. Learning with Purpose
Training curves on CIFAR-10 of various shortcuts
Solid lines denote test error (y-axis on the right), and dashed lines denote training loss (y-axis on the left)
24. Learning with Purpose
ReLu vs. ReLu+BN
• BN could block propagation
• Keep the shortest path as smooth
as possible
25. Learning with Purpose
ReLu vs. Identity
• ReLu could block
propagation when the
network is deep
• Pre-activation ease the
difficulty in optimization
27. Learning with Purpose
Conclusion From He
Keep the shortest path as smooth (clean) as possible
By making h(x) and f(x) identity mapping
Forward and backward signals directly flow this path
Features of any layer is additive outcome
1000-layer ResNet can be easily trained and have better accuracy
28. Learning with Purpose
Further expansion of Residual network
yl
yl+1
fl()
According to previous analysis, and we replace
xl with yl and F with fl
We further expand this expression by unrolling the
recursion in terms of basic input y.
A novel interpretation of residual networks
29. Learning with Purpose
Example of unrolling
We take L=3 and l=0 for example
of unrolling
The data flows along paths
exponentially from input to
output
We infer that residual networks
have 2^n paths
30. Learning with Purpose
Different from traditional Neural Network
In traditional NN, each layer only depends on the previous layer
In ResNet, data flows along many paths from input to output. Each path is
a unique configuration of which residual module to enter and which to
skip
31. Learning with Purpose
Deleting individual module in ResNet
Deleting a layer in residual networks at
test time (a) is equivalent to zeroing
half of the paths.
In ordinary feed-forward networks
(b) such as VGG or AlexNet, deleting
individual layers alters the only viable
path from input to output.
33. Learning with Purpose
Deleting many modules in ResNet
One key characteristic of ensembles
is their smooth performance with
respect to the number of members.
When k residual modules are
removed, the effective number of
paths is reduced from 2^n to 2^(n-
k)
Error increases smoothly when randomly deleting several modules from a
residual network
34. Learning with Purpose
Reordering moduals in ResNet
Error also increases smoothly when re-ordering a residual network by shuffling
building blocks. The degree of reordering is measured by the Kendall Tau
correlation coefficient.
35. Learning with Purpose
Conclusion
First, unraveled view reveals that residual networks can be viewed as a
collection of many paths, instead of a single ultra deep network
Second, lesion studies show that, although these paths are trained jointly,
they do not strongly depend on each other.
Hi today I am gona to introduce Deep residual networks. This presentation is about 3 papers. The first two are from Kaiming He and his team, and the third one is a novel interpretation of residual network. I know all of you are familiar with resnet, so if there is anything I don’t understand it right or if you think there is anything I should know, please don’t hesitate to tell me.
although you may know about the contribution and competition of what resnet did, I still want share this with you. Resnet won a lot of competitions. It won 1st places in all five main tracks. Like Imagenet classification/ detection localization and coco detection and segmentation.
From the picture o evolution of deep networks, we can see that the winning solutions have become deeper and deeper, it is from 8 layers in 2012 to 200+ layers in 2016. resnet brought a big improvement in the performance,
It is noteworthy that the gating and 1×1 convolutional shortcuts introduce more parameters, and should have stronger representational abilities than identity shortcuts.
However, their training error is higher than that of identity shortcuts, indicating that the degradation of these models is caused by optimization issues, instead of representational abilities.