Understanding the receptive field of deep convolutional networks

In this article, we will discuss multiple perspectives that involve the receptive field of a deep convolutional architecture. We will address the influence of the receptive field starting for the human visual system. As you will see, a lot of terminology of deep learning comes from neuroscience. As a short motivation, convolutions are awesome but it is not enough just to understand how it works. The idea of the receptive field will help you dive into the architecture that you are using or developing. If you are looking for an in-depth analysis to understand how you can calculate the receptive field of your model as well as the most effective ways to increase it, this article was made for you. In the end, fundamentals are to be mastered! Let’s begin.

According to Wikipedia [1], the receptive field (of a biological neuron) is “the portion of the sensory space that can elicit neuronal responses, when stimulated”. The sensory space can be defined in any dimension (e.g. a 2D perceived image for an eye). Simply, the neuronal response can be defined as the firing rate (i.e. number of action potentials generated by a neuron). It is related to the time dimension based on the stimuli. What is important is that it affects the received frames per second (FPS) of our visual system. It is not clear what is the exact FPS of our visual system, and it is definitely changing in different situations (i.e. when we are in danger). Wikipedia [2] says:

Insight: The human visual system can process 10 to 12 images per second and perceive them individually, while higher rates are perceived as motion.

Let’s observe this image to further clarify these concepts:

The visual human system. Source: brainconnection

Based on the image, the entire area (the grid in the figure) an eye can see is called the field of view. The human visual system consists of millions of neurons, where each one captures different information. We define the neuron’s receptive field as the patch of the total field of view. In other words, what information a single neuron has access to. This is in simple terms the biological cell’s receptive field.

Let’s see how we can extend this idea in convolutional networks.

For a holistic overview on computer vision with deep learning, we recommend the "Deep Learning for Vision Systems" book. Use the discount code aisummer35 to get an exclusive 35% discount from your favorite AI blog.

What is the receptive field in deep learning?

Similarly, in a deep learning context, the Receptive Field (RF) is defined as the size of the region in the input that produces the feature[3]. Basically, it is a measure of association of an output feature (of any layer) to the input region (patch). Before we move on, let’s clarify one important thing:

Insight: The idea of receptive fields applies to local operations (i.e. convolution, pooling).

Source: Research Gate

A convolutional unit only depends on a local region (patch) of the input. That’s why we never refer to the RF on fully connected layers since each unit has access to all the input region. To this end, our aim is to provide you an insight into this concept, in order to understand and analyze how deep convolutional networks work with local operations work.

Ok, but why should anyone care about the RF?

Why do we care about the receptive field of a convolutional network?

There is no better way to clarify this than a couple of computer vision examples. In particular, let’s revisit a couple of dense prediction computer vision tasks. Specifically, in image segmentation and optical flow estimation, we produce a prediction for each pixel in the input image, which corresponds to a new image, the semantic label map. Ideally, we would like each output pixel of the label map to have a big receptive field, so as to ensure that no crucial information was not taken into account. For instance, if we want to predict the boundaries of an object (i.e. a car, an organ like the heart, a tumor) it is important that we provide the model access to all the relevant parts of the input object that we want to segment. In the image below, you can see two receptive fields: the green and the orange one. Which one would you like to have in your architecture?

The green and orange rectangles are two different receptive fields. Which one would you prefer? The green and orange rectangles are two different receptive fields. Which one would you prefer? Source: Nvidia's blog

Similarly, in object detection, a small receptive field may not be able to recognize large objects. That’s why you usually see multi-scale approaches in object detection. Furthermore, in motion-based tasks, like video prediction and optical flow estimation, we want to capture large motions (displacements of pixels in a 2D grid), so we want to have an adequate receptive field. Specifically, the receptive field should be sufficient if it is larger than the largest flow magnitude of the dataset.

Therefore, our goal is to design a convolutional model so that we ensure that its RF covers the entire relevant input image region.

To convince you even more, in the diagram below you can see the relation of the RF to the classification accuracy in ImageNet. The radius refers to the amount of floating-point operations (FLOPs) of each model. The purple corresponds to the ResNet [4] family (50-layer, 101, and 152 layers), while the yellow is the inception [5] family (v2, v3, v4). Light-blue is the MobileNet architecture.

Image is borrowed from Araujo et al. [3]

As perfectly described by Araujo et al. [3]:

“We observe a logarithmic relationship between classification accuracy and receptive field size, which suggests that large receptive fields are necessary for high-level recognition tasks, but with diminishing rewards.”

Nevertheless, the receptive field size alone is not the only factor contributing to improved recognition performance. However, the point is that you should definitely be aware of your model’s receptive field.

Ok, so how can we measure it?

Closed-form calculations of the receptive field for single-path networks

In the amazing work by Araujo et al. [3], they provide an intuitive way to calculate in an analytical form the RF of your model. A single path literally means no skip connections in the architecture, like the famous AlexNet. Let’s see some math! For two sequential convolutional layers $f2 , f1$ with kernel size $k$ , stride $s$ , receptive field $r$ :

r_1 = s_2 \times r_2 + (k_2-s_2)

Or in a more general form:

r_{(i-1)} = s_{i} \times r_{i} + (k_{i}-s_{i})

The image below may help you clarify this equation. Note that we are interested to see the influence of the receptive field starting from the last layer towards the input. So, in that sense, we go backwards.

1D sequential conv. Layers visualization taken from Araujo et al. [3]

It seems like this equation can be generalized in a beautiful compact equation that simply applies this operation recursively for L layers. By further analyzing the recursive equation, we can derive a closed form solution that depends only on the convolutional parameters of kernels and strides [3]:

r_0 = \sum_{i=1}^{L} ( (k_{i} -1) \prod_{j=1}^{l-1} s_{j} ) + 1 \quad \quad (eq.1)

Where $r_0$ denoted the desired RF of the architecture.

Ok, I measured the theoretical RF of my model. Now, how can I increase it?

How can we increase the receptive field in a convolutional network?

In essence, there are a plethora of ways and tricks to increase the RF, that can be summarized as follows:

Add more convolutional layers (make the network deeper)
Add pooling layers or higher stride convolutions (sub-sampling)
Use dilated convolutions
Depth-wise convolutions

Let’s look at the distinct characteristics of these approaches.

Add more convolutional layers

Option 1 increases the receptive field size linearly, as each extra layer increases the receptive field size by the kernel size [7]. Moreover, it is experimentally validated that as the theoretical receptive field is increasing but the effective (experimental) receptive field is reducing. RF refers to the RF, while ERF corresponds to the effective RF.

Increasing the number of layers decreases the ERF ration, taken from Luo et al. [7]

Sub-sampling and dilated convolutions

Sub-sampling techniques like pooling (option 2) on the other hand, increases the receptive field size multiplicatively. Modern architectures like ResNet combine these techniques(option 1 and 2). On the other hand, sequentially placed dilated convolutions, increase the RF exponentially.

But first, let’s revisit the idea of dilated convolutions.

In essence, dilated convolutions introduce another parameter, denoted as r, called the dilation rate. Dilations introduce “holes” in a convolutional kernel [3]. The “holes” basically define a spacing between the values of the kernel. So, while the number of weights in the kernel is unchanged, the weights are no longer applied to spatially adjacent samples. Dilating a kernel by a factor of $r$ introduces a kind of striding of $r$ .

The pre-described equations can be reused by simply replacing the kernel size $k$ for all layers using dilations:

k’= r (k−1)+1

Keep this equation in the back of your mind.

All the above can be illustrated in the following gif, produced by Dumoulin et al. 2016 [11]. I think the image speaks for itself:

convolutional-arithmetic Source: A guide to convolutional arithmetic Dumoulin et al. 2016 [11]

Now, let’s briefly inspect how dilated convolutions can influence the receptive field.

Let’s see 3 sequential conv. Layers (denoted by a,b,c) that are illustrated in the image with normal convolution, r=2 dilation factor, and r=4 dilation factor. We will intuitively understand why dilation supports an exponential expansion of the receptive field without loss of resolution (i.e. pooling) or coverage.

Image is borrowed from Yu et al. 2015 [9]

Analysis

In (a) we have a normal 3x3 convolution with receptive field 3x3. In (b) we have a 2-dilated 3x3 convolution that is applied in the output of layer (a) which is a normal convolution. As a result, each element in the 2 coupled layers now has a receptive field of 7×7. If we studied 2-dilated conv alone the receptive field would be simply 5x5 with the same number of parameters. In (c) by applying a 4-dilated convolution, each element in the third sequential conv layer now has a receptive field of 15×15. As a result, the receptive field grows exponentially while the number of parameters grows linearly [9].

In other words, a 3x3 kernel with a dilation rate of 2 will have the same receptive field as a 5x5 kernel, while only using 9 parameters. Similarly, a 3x3 kernel with a dilation rate of 4 will have the same receptive field as a 9x9 kernel without dilation. Mathematically:

r (k −1)+1 = k_{prev}

Insight: In deep architectures, we often introduce dilated convolutions in the last convolutional layers.

Below you can observe the resulting ERF (effective receptive field) when introducing pooling operation and dilation in an experimental study performed by [7]. Obviously, the receptive field is bigger in both cases while with pooling it is observed to be larger in a practical setup. We will see more about the effective receptive field later on.

A visualization of the effective receptive field (ERF) by introducing pooling strategies and dilation, taken from Luo et al. 2016 [7]

Insight: Based on [7], pooling operations and dilated convolutions turn out to be effective ways to increase the receptive field size quickly.

Finally as described in Araujo et al. [3], with depth-wise convolutions the receptive field is increased with a small compute footprint, so it is considered a compact way to increase the receptive field with fewer parameters. Depthwise convolution is the channel-wise spatial convolution. However, note that depth-wise convolutions do not directly increase the receptive field. But since we use fewer parameters with more compact computations, we can add more layers. Thus, with roughly the same number of parameters, we can get a bigger receptive field. MobileNet [10] achieves high recognition performance based on this idea.

Skip-connections and receptive field

If you want to revisit the ideas behind skip connections, feel free to check my relevant article.

In a model without any skip-connections, the receptive field is considered fixed. However, when introducing $n$ skip-residual blocks, the networks utilize $2^n$ different paths and therefore features can be learned with a large range of different receptive fields [8]. For example, the HighResNet architecture [8] has a maximum receptive field of 87 pixels, coming from 29 unique paths. In the following figure, we can observe the distribution of the receptive field of these paths in the architecture. The receptive field, in this case, ranges from 3 to 87, following a binomial distribution.

The histogram of receptive field distribution of HighResNet [8]

Insight: Skip-connections may provide more paths, however, based on [7], they tend to make the effective receptive field smaller.

Receptive field and transposed convolutions, upsampling, separable convolutions, and batch normalization

Upsampling

Upsampling is also a local operation. Regarding the RF computation purposes can be considered to have a kernel size equal to the number of input features involved in the computation of an output feature. Since we usually double the spatial dimension, as shown in the figure below, the kernel is k=1.

Upsampling borrowed from here

Separable convolutions

In short, the RF properties of the separable convolution are identical to its corresponding equivalent non-separable convolution. So, practically nothing changes in terms of the receptive field.

Batch normalization

During training, batch normalization parameters are computed based on all the channel elements of the feature map. Thus, one can state that its receptive field is the whole input image.

Understanding the effective receptive field

In [7], Luo et al. 2016 discover that not all pixels in a receptive field contribute equally to an output unit’s response. In the previous image, we observed that the receptive field varies with skip connections.

Obviously, the output feature is not equally impacted by all pixels within its receptive field. Intuitively, it is easy to perceive that pixels at the center of a receptive field have a much larger impact on output since they have more “paths” to contribute to the output.

As a natural consequence, one can define the relative importance of each input pixel as the effective receptive field (ERF) of the feature. In other words, ERF defines the effective receptive field of a central output unit as the region that contains any input pixel with a non-negligible impact on that unit.

Specifically, as it is referenced in [7] we can intuitively realize the contribution of central pixels in the forward and backward pass as:

“In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptive field have very few paths to propagate its impact. In the backward pass, gradients from an output unit are propagated across all the paths, and therefore the central pixels have a much larger magnitude for the gradient from that output.” ~ by Luo et al. 2016 [7].

A natural way to measure this impact is of course the partial derivative, rate of change of the unit with respect to the input, as it is computed by backpropagation.

The effective receptive field with and without non-linearities, borrowed from Luo et al. 2016 [7]

As it is illustrated in the figure, the ERF is a perfect example of a textbook 2D Gaussian distribution. However, when we add non-linearities, we force the distribution to deviate from a perfect Gaussian. In simple terms, when the pixel-value is zeroed with the ReLU, no path from the receptive field can reach the output, hence the gradient is zero.

Based on this study the main insight is the following:

Insight: The ERF in deep convolutional networks actually grows a lot slower than we calculate in theory [7].

Last but not least, it is super important to highlight that after the training process the ERF is increased, minimizing the gap between the theoretical RF and the ERF before training.

Conclusion

In this article, we inspected several aspects of the concept of the Receptive Field. We smoothly started from the human visual systems so as to make the concepts crystally clear. We discussed the closed-form math, skip connections in RF, and how you can increase it efficiently. Based on that, you can implement the referenced design choices in your model, while being aware of its implications.

Finally, the take-away key points of this article are summarized below:

The idea of receptive fields applies to local operations.
We want to design a model so that it’s receptive field covers the entire relevant input image region.
By using sequential dilated convolutions the receptive field grows exponentially, while the number of parameters grows linearly.
Pooling operations and dilated convolutions turn out to be effective ways to increase the receptive field size quickly.
Skip-connections may provide more paths, but tend to make the effective receptive field smaller.
The effective receptive field is increased after training.

As a final note, the understanding of RF in convolutional neural networks is an open research topic that will provide a lot of insights on why deep convolutional networks work so damn awesomely.

Additional material

As an additional resource on the interpretation and visualization of RF, I would advise you to take a look at Kobayashi et al. 2020 [12]. For our more practical reader, if you want a toolkit to automatically measure the receptive field of your model in Pytorch in Tensorflow, we got your back. Finally, for those of you who are hungry for knowledge and curious for bio-inspired concepts like me, especially about the human visual system, you can watch this highly recommended starting video:

Cited as:

@article{adaloglou2020receptive,
  title   = "Understanding the receptive field of deep convolutional networks",
  author  = "Adaloglou, Nikolas",
  journal = "https://theaisummer.com/",
  year    = "2020",
  url     = "https://theaisummer.com/receptive-field/"
}

References

[1] Wikipedia: Receptive field

[2] Wikipedia: Frame rate: Human Vision

[3] Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), e21.

[4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[5] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).

[7] Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898-4906).

[8] Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M. J., & Vercauteren, T. (2017, June). On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In International conference on information processing in medical imaging (pp. 348-360). Springer, Cham.

[9] Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.

[10] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

[11] Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016).

[12] Kobayashi, G., & Shouno, H. (2020). Interpretation of ResNet by Visualization of Preferred Stimulus in Receptive Fields. arXiv preprint arXiv:2006.01645.

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.