We study the learned visual representations of CNNs and ViTs, such as texture bias, how to learn good representations, the robustness of pretrained models, and finally properties that emerge from trained ViTs.
A review of state of the art vision-language models such as CLIP, DALLE, ALIGN and SimVL
Implement a UNETR to perform 3D medical image segmentation on the BRATS dataset
Learn all there is to know about transformer architectures in computer vision, aka ViT.
Learn about the Hugging Face ecosystem with a hands-on tutorial on the datasets and transformers library. Explore how to fine tune a Vision Transformer (ViT)
Learn everything there is to know about the attention mechanisms of the infamous transformer, through 10+1 hidden insights and observations
Understand how positional embeddings emerged and how we use the inside self-attention to model highly structured data such as images
Learn about the einsum notation and einops by coding a custom multi-head self-attention unit and a transformer block
In this article you will learn how the vision transformer works for image classification problems. We distill all the important details you need to grasp along with reasons it can work very well given enough data for pretraining.
An intuitive understanding on Transformers and how they are used in Machine Translation. After analyzing all subcomponents one by one such as self-attention and positional encodings , we explain the principles behind the Encoder and Decoder and why Transformers work so well
New to Natural Language Processing? This is the ultimate beginner’s guide to the attention mechanism and sequence learning to get you started