Vision Transformer Coding in PyTorch, Part 1: Bird’s Eye View | by Borna Ahmadzadeh | April 2022

Photo by Justin Wilkens on Unsplash

In this two-part series, we’ll learn about the Vision Transformer (ViT), which is taking the computer vision world by storm, and code it, from scratch, in PyTorch. You can find the accompanying GitHub repository here.

Audience: The backbone of Vision Transformers is the Transformer, and while this is also written from scratch, it won’t be explained in too much depth. Therefore, a deep understanding of transformers (especially multi-headed self-attention) is necessary, but other than that, there are no prerequisites.

Without further ado, let’s get to coding!

Attention is all you need marked a pivotal moment in advances in deep learning when it pioneered the transformer, a network that outperformed more complex networks such as long-short-term memory (LSTM) architecture on many natural language processing (NLP) benchmarks while being considerably faster. Multi-headed self-attention (MHSA) is the powerhouse behind the transformer, and it allowed researchers to get rid of recurrence and convolutions, which were previously ubiquitous in NLP models. The Transformers have proven the value of MHSA and shown that, at least in NLP, it really is all you need.

Attention has also been widely used in convolutional neural networks (CNNs). For example, compression and excitation (SE), an indispensable component of state-of-the-art CNNs like EfficientNetV2, is a form of attention that adaptively weighs channels in a feature map. Effective channel attention (ECA) relies on SE and seeks to provide an equivalent improvement in accuracy with fewer parameters.

Many other algorithms could be cited, and a few attributes they all share are I) They are not intended to replace convolutions; on the contrary, they are complementary to convolutions and II) MHSA is not employed by them, due to two defensible justifications:

  1. Convolutions have inductive biases suitable for computer vision. Translation invariance, for example, ensures that the output of a CNN will not change dramatically if the input pixels are moved one to the right, a desirable property because moving an image would not change its semantics.
  2. Multi-headed self-attention is notorious for its cost O(not²), which makes it completely impractical for high resolution photos if one treats each individual pixel as a token.

Therefore, one cannot naively eliminate convolutions or perform multi-head self-attention on images. Does this mean that CNNs will forever be intertwined with vision applications and that MHSA will not dig into the world of images?

Not enough. A picture is worth 16 x 16 words: Transformers for large-scale image recognition showed how to overcome these obstacles and obtain excellent results for the classification of objects exclusively with transformers. Specifically, they took advantage of Google’s huge proprietary dataset, JFT-300M, to compensate for the lack of image-specific inductive biases (according to the article, “large-scale training outweighs inductive bias”) and divide the images into patches to significantly reduce the number of tokens.

The former need not be considered here because our focus is the Vision Transformer itself, not the training regiment. However, the tokenization process is of the utmost importance because it is what made multi-headed self-attention possible for vision, and it is also the only difference between an NLP transformer and a ViT.

First, we’ll take a step back and look at transformers in their natural habitat, which is natural language processing. A sentence is transformed into a set of D-dimensional vectors (i.e. tokens) that represent words, and they are passed to a transformer. A layer of integration is usually applied to achieve this, and additional modules such as dropout can be inserted for a higher score.

An illustration of how the sentence, “Mad dog jumped”, is symbolized with an integration dimension of 4. Diagram by me

As mentioned, this recipe cannot be directly applied to images due to memory and time requirements. That is, it is not viable to treat each pixel as a token like we do with words because even a 256 X 256 image would have 65536 pixels (for simplicity, images are single-channel), orders magnitude greater than modern accelerators can handle.

Vision transformers solve this problem by dividing an image into non-overlapping square patches and treating each patch as a token. Therefore, a 256 X 256 image, assuming a patch is 16 X 16 pixels, would be partitioned into 16 patches = 256 (image height) / 16 (patch height) along its axis of height and 16 patches = 256 (image width) / 16 (patch width) along its width axis, or only 256 tokens in total.

A cell represents a pixel, the image is 4 X 4, and the patches are 2 X 2. In the case of the all-blue square on the far left, if a pixel were a token, there would be 16 tokens. However, in the square to its right, all cells of the same color represent a token (for example, the top left square, which is [1, 5, 4, 8], is a token), or 4 in total. Conceptually, a token and its associated values ​​here parallel the NLP example above. Same

It would, however, be sub-optimal to directly deliver these tokens to a processor. For example, in the diagram above, each token is 4-dimensional, but incorporating more dimensions could improve accuracy by improving the model’s representation capabilities. Therefore, the tokens are passed through a linear transformation (the same weights and biases for all tokens), and then they are treated like those in NLP.


Thus, the general pipeline of transformers in vision is almost indistinguishable from that of NLP, except for the tokenization process.

Two elements present in the ViTs and Omitted NLP transformers are the class token and the positional encoding. As a reminder, the class token is a learnable token concatenated with other tokens whose final value is passed to a head to generate predictions (similar to mean pooling in CNNs),

And positional coding informs the model of the position of the tokens by adding a set of learnable parameters to each.

A multitude of concepts have been studied to date, so here is a summary. First, an image is dissected into several square patches and flattened. Then the flattened patches are run through a linear layer (one linear layer for all patches) for improved performance. Then the class token is concatenated and the tokens are summed with learnable parameters for positional encoding.

That’s all there is to Vision Transformer, and you should now have a good understanding of the theory behind it!

In this article, we learned about the vision transformer, a slightly modified version of the NLP transformer for vision applications. The main difference between the two is the tokenization procedure; in natural language processing, separate words are considered separate tokens, but the quadratic cost of multi-headed self-attention makes viewing each pixel as a token impractical. Instead, it’s square blocks of pixels that act like tokens, greatly reducing the computational load on processors.

In the next article, we will expand on the Vision Transformer in PyTorch and discuss the annoyances associated with it.

Please, if you have any questions or comments feel free to post them in the comments below and, as always, thanks for reading!


Social media:

Comments are closed.