ResNet, torchvision, bottlenecks, and layers not as they seem.
I was strolling through the torchvision ResNet codebase when I stumbled upon this interesting code snippet:
It is a simple enough piece of code, and exists in the ResNet class. Its function is to allow the insertion of many layers into the resnet based on the block type (Basic residual layer vs Bottleneck layer), planes (activations within the block), and stride. Sometimes we include a downsample layer to a block if we are downsampling or our inplanes
are not the same as planes
times an expansion factor. Probably none of that was clear to you. Certainly none of this was clear to me, so I had to do some more digging. I found that the behavior works correctly, but not much of the behavior is intuitive, so I’ll try to explain. Here is some context to the _make_layer
function:
ResNet will call _make_layer
and its behavior will be different depending on which resnet architecture you want. These include resnet18, 34, 50, 101, and 152, all of which are described by two things: the type of block they are using, and how many layers each_makelayer
call winds up making. (Yes, each_make_layer
makes many more than one layer so keep that in mind.)
For resnet18 and 34, this is pretty easy. Our block type can be a BasicBlock. This is just your standard residual block as understood in the original ResNet paper.
But as we add many more layers to the network for resnet50 and beyond, we can’t afford to waste so much of our GPU ram on those expensive 3x3 convolutions, so we use BottleNeck blocks. A BottleNeck block is very similar to a BasicBlock. All it does is use a 1x1 convolution to reduce the channels of the input before performing the expensive 3x3 convolution, then using another 1x1 to project it back into the original shape.
Okay, so most of the time we want to project back to the channels of the input tensor. But sometimes we want to downsample our input by a factor of two. If you are familiar with VGG, then you understand this can be done with MaxPooling. ResNet does use one MaxPool layer in its initial base, but all the rest are done through stride-2 convolutions instead. So we will have to keep track of that stride, and make sure our layers aren’t ever left with an input and an output that can’t be summed together. So far _make_layer
is looking pretty straightforward. Let’s review:
First let’s ignore the if statement and pretend it doesn’t exist. If we were to have a stride=2 passed to the block, then its behavior is to apply a stride-2 convolution to the input, do some extra processing (could be basic or bottleneck block), and then try to add it to the initial input. But if stride=2, then the input won’t match the processing of the residual block. So we need that extra downsample
layer, provided by the if statement, applied to the input before adding it to the processed results of the residual block. Once we have done this, then we can perform all the other convolutions within that residual chunk before we get to the next chunk, which will probably start with a stride-2 convolution again.
This all makes sense. But there are some loose strings. The main one being, what is that part of the if statement checking for self.inplanes
not matching planes? The description is fairly easy. self.inplanes
is just the number of channels being provided to the block. planes
is more subtle. For a BasicBlock it represents the output channels of the block. (And intermediate channels as they are the same.) And since BasicBlock has an expansion value of 1, planes * block.expansion
is just planes
. So if the input channels are not same as the output channels ( self.inplanes ~= planes
), we can’t add the input back in, so we collect our downsample
function from the conditional so we can apply it to the input before we add it to the result.
But for BottleNeck blocks, this becomes tricky. For a BottleNeck, planes
refers to the number of reduced channels in the middle of the BottleNeck block. It just happens to be that planes, as given by the values in the __init__
function, will always be one fourth the channels of the output to that channel. We wouldn’t know this unless we looked at the definition for a BottleNeck block and found that the expansion factor is always 4 and the output of the block is always 4 * planes. So for the first _make_layer
chunk of resnet50, the first BottleNeck block will change the number of channels from 64 to 256, but stride will be equal to 1, so there will be no change in the height and width. But remember:
So self.inplanes
is equal to 64 and planes * block.expansion
planes = 64 * 4 = 256. Which are not equal. So we do indeed wind up making a downsample block. This is what we want. But the unfortunate piece to this means that the name of the layer is not the same as its function. The role of downsample is to be an adapter, not a downsampler. Because it can either exist to make the channels consistent, the height and width consistent, or both.
This is a flexible way to write code to generate ResNets, but it is very unclear what is happening unless you trace all possibilities within all networks. It leads to even more difficulty when reading code by programmers who are doing invasive network surgery, creating exotic networks with new dilation ratios and different strides.
But there is one more thing that makes the code somewhat weird. As it turns out, for every resnet 34–152, whenever there is a change in height/width, there is also a change in channels. But not the opposite. This means that, for all standard networks, stride != 1 is irrelevant in the conditional! This does not mean that it is not an important part of the conditional, but it would only be useful if if, for instance, the stride were changed on the first _make_layer
group in resnet18 or resnet34, which have BasicBlocks.
Torchvision is awesome and there is a lot of cool stuff to look for, but it can be difficult to rummage through because it is fairly obfuscated. This is a big loss, because many important strides can be made for these networks by simply changing the behavior of a few lines of code.
So be sure to really understand how the ubiquitous code works, because it makes it much easier to change things for your own projects. Hopefully your network surgeries now have better outcomes!