ML/Data science blogs

From Masked Picture Modeling to Autoregressive Picture Modeling

June 16, 2024

A short evaluate of the picture basis mannequin pre-training aims

We crave giant fashions, don’t we?

The GPT collection has proved its capacity to revolutionize the NLP world, and everybody is worked up to see the identical transformation within the pc imaginative and prescient area. The preferred picture basis fashions in recent times embody SegmentAnything, DINOv2, and lots of others. The pure query is, what are the important thing variations between the pre-training stage of those basis fashions?

As an alternative of answering this query instantly, we are going to gently evaluate the picture basis mannequin pre-training aims utilizing Masked Picture Modeling on this weblog article. We will even talk about a paper (to be) printed in ICML’24, making use of the Autoregression Modeling to basis mannequin pre-training.

Picture supply: https://pxhere.com/en/photograph/1025277

What’s mannequin pre-training in LLM?

Mannequin pre-training is a terminology used basically giant fashions (LLM, picture basis fashions) to explain the stage the place no label is given to the mannequin, however coaching the mannequin purely utilizing a self-supervised method.

Widespread pre-training strategies principally originated from LLMs. For instance, the BERT mannequin used Masked Language Modeling, which impressed Masked Picture Modeling corresponding to BEiT, MAE-ViT, and SimMM. The GPT collection used Autoregressive Language Modeling, and a not too long ago accepted ICML publication prolonged this concept to Autoregressive Picture Modeling.

So, what are Masked Language Modeling and Autoregressive Language Modeling?

The Masked Language Modeling was first proposed within the BERT paper in 2018. The strategy was described as “merely masking some proportion of the enter tokens randomly after which predicting these masked tokens.” It’s a bi-directional illustration strategy, because the mannequin will attempt to predict forwards and backwards on the masked token.

1*WjBXQ6OD Jd0haPOPqZ cQ — Masked LM pre-training. Picture supply: https://arxiv.org/abs/1810.04805

The Autoregressive Language Modeling was famously recognized from the GPT3 paper. It has a clearer definition within the XLNet paper as follows, and we are able to see the mannequin is unidirectional. The explanation the GPT collection makes use of a unidirectional language mannequin is that the structure is decoder-based, which solely wants self-attention on the immediate and the completion:

1*mwrCHerj Gy rZNRNI9BEQ — AR — Autoregression. Supply: https://arxiv.org/abs/1906.08237

Pre-training in Picture Area

When transferring into the picture area, the quick query is how we type the picture “token sequence.” The pure considering is simply to make use of the ViT structure, breaking a picture right into a grid of picture patches (visible tokens).

BEiT. Printed as an arXiv preprint in 2022, the thought of BEiT is easy. After tokenizing a picture right into a sequence of 14*14 visible tokens, 40% of the tokens are randomly masked, changed by learnable embeddings, and fed into the transformer. The pre-training goal is to maximise the log-likelihood of the proper visible tokens, and no decoder is required for this stage. The pipeline is proven within the determine under.

1*D8C5QWcJPLr 5BXiM9IXCA — BEiT pre-training pipeline. Picture supply: https://arxiv.org/abs/2106.08254

Within the authentic paper, the authors additionally supplied a theoretical hyperlink between the BEiT and the Variational Autoencoder. So the pure query is, can an Autoencoder be used for pre-training functions?

MAE-ViT. This paper answered the query above by designing a masked autoencoder structure. Utilizing the identical ViT formulation and random masking, the authors proposed to “discard” the masked patches throughout coaching and solely use unmasked patches within the visible token sequence as enter to the encoder. The masks tokens will probably be used for reconstruction throughout the decoding stage on the pre-training. The decoder could possibly be versatile, starting from 1–12 transformer blocks with dimensionality between 128 and 1024. Extra detailed architectural info could possibly be discovered within the authentic paper.

Masked Autoencoder structure. Picture supply: https://arxiv.org/abs/2111.06377

SimMIM. Barely completely different from BEiT and MAE-ViT, the paper proposes utilizing a versatile spine corresponding to Swin Transformer for encoding functions. The proposed prediction head is extraordinarily light-weight—a single linear layer of a 2-layer MLP to regress the masked pixels.

1*StrSZHV79VcQfzHqk7MkeQ — SimMIM pipeline. Picture supply: https://arxiv.org/abs/2111.09886

AIM. A latest paper accepted by ICML’24 proposed utilizing the Autoregressive mannequin (or causal mannequin) for pre-training functions. As an alternative of utilizing a masked sequence, the mannequin takes the total sequence to a causal transformer, utilizing prefixed self-attention with causal masks.

1*N0VFQpnzi3hZ9Yir9SyO9w — AIM with Causal Transformer. Picture supply: https://arxiv.org/abs/2401.08541

What’s prefixed causal consideration? There are detailed tutorials on causal consideration masking on Kaggle, and right here, it’s masking out “future” tokens on self-attention. Nevertheless, on this paper, the authors declare that the discrepancy between the causal masks and downstream bidirectional self-attention would result in a efficiency challenge. The answer is to make use of partial causal masking or prefixed causal consideration. Within the prefix sequence, bidirectional self-attention is used, and causal consideration is utilized for the remainder of the sequence.

1*BPbeDSWTn0uU4 RIsFRwrw — Causal consideration throughout pre-training. Picture supply: https://arxiv.org/abs/2401.08541

What’s the benefit of Autoregressive Picture Masking? The reply lies within the scaling, of each the mannequin and knowledge sizes. The paper claims that the mannequin scale instantly correlates with the pre-training loss and the downstream activity efficiency (the next left subplot). The uncurated pre-training knowledge scale can be instantly linked to the downstream activity efficiency (the next proper subplot).

1*oaorrYzEFDsY5qeNvHpz A — Scaling impact of the AIM. Picture supply: https://arxiv.org/abs/2401.08541

In comparison with a 50% masking ratio, the AIM achieved an astonishing 8% efficiency enhance over Masked Picture Modeling.

1*taUMFBO pA0cAswK luKvA — Desk supply: https://arxiv.org/abs/2401.08541

So, what’s the large takeaway right here? The AIM paper mentioned completely different trade-offs between the state-of-the-art pre-training strategies, and we received’t repeat them right here. A shallower however extra intuitive lesson is that there’s possible nonetheless a lot work left to enhance the imaginative and prescient basis fashions utilizing present expertise from the LLM area, particularly on scalability. Hopefully, we’ll see these enhancements within the coming years.

References

El-Nouby et al., Scalable Pre-training of Giant Autoregressive Picture Fashions. ICML 2024. Github: https://github.com/apple/ml-aim
Xie et al., SimMIM: a Easy Framework for Masked Picture Modeling. CVPR 2022. Github: https://github.com/microsoft/SimMIM
Bao et al., BEiT: BERT Pre-Coaching of Picture Transformers. arXiv preprint 2022. Github: https://github.com/microsoft/unilm/tree/grasp/beit
He et al., Masked autoencoders are scalable imaginative and prescient learners. CVPR 2022. HuggingFace Official: https://huggingface.co/docs/transformers/en/model_doc/vit_mae
Caron et al., Rising Properties in Self-Supervised Imaginative and prescient Transformers. ICCV 2021. Github: https://github.com/facebookresearch/dino?tab=readme-ov-file
Liu et al., Swin transformer: Hierarchical imaginative and prescient transformer utilizing shifted home windows. ICCV 2021. Github: https://github.com/microsoft/Swin-Transformer
Brown et al., Language Fashions are Few-Shot Learners. NeurIPS 2020. Github: https://github.com/openai/gpt-3
Yang et al., Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS 2019. Github: https://github.com/zihangdai/xlnet
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint 2018. HuggingFace Official: https://huggingface.co/docs/transformers/en/model_doc/bert

From Masked Picture Modeling to Autoregressive Picture Modeling was initially printed in In the direction of Information Science on Medium, the place individuals are persevering with the dialog by highlighting and responding to this story.

Supply hyperlink

A short evaluate of the picture basis mannequin pre-training aims

LEAVE A REPLY Cancel reply