Deep Learning for NLP - Part 6

Part 6: Popular Transformer Models

Ratings 5.00 / 5.00

What You Will Learn!

Deep Learning for Natural Language Processing
Popular Transformer encoder and decoder models
Multi-modal Transformer models
Large scale Transformer models
DL for NLP

Description

This course is a part of "Deep Learning for NLP" Series. In this course, I will talk about various popular Transformer models beyond the ones I have already covered in the previous sessions in this series. Such Transformer models including encoder as well as decoder based models and differ in terms of various aspects like form of input, pretraining objectives, pretraining data, architecture variations, etc.

These Transformer models have been all proposed after 2019 and some of them are also from early 2021. Thus, as of Aug 2021, these models are very recent and state of the art across multiple NLP tasks.

The course consists of three main sections as follows.

In the first section, I will talk about a few Transformer encoder and decoder models which extend the original Transformer framework. Specifically I will cover SpanBERT, Electra, DeBERTa and DialoGPT. SpanBERT, Electra and DeBERTa are Transformer encoders while DialoGPT is a Transformer decoder model. For each model, we will also talk about their architecture or pretraining differs from standard Transformer. We will also talk important results on various NLP tasks.

In the second section, I will talk about multi-modal Transformer models. Multimodal learning has gained a lot of momentum in recent years. Thus, there was a need to come up with Transformer models which could handle text and image data together. In this part, I will cover VisualBERT and vilBERT which both process the multi-modal input very effectively. Both the models have many similarities. We will discuss about theri similarities and differences in detail.

Lastly, in the third section, I will talk about lareg scale Transformer models. I will introduce the mixture of experts (MoE) architecture. Then I will talk about how GShard adapts the MoE architecture, and shows great results on massive multilingual machine translation. Lastly, I will discuss Switch Transformers which simplify the MoE routing algorithm and also do several engineering optimizations to reduce network communciation and computation costs and mitigate instabilities.

In general, each of these papers is pretty long and thus it becomes very difficult and time consuming to understand them. In these sessions, I have tried to summarize them nicely bringing out the intuitions and tying the important concepts across such papers in a coherent story. Hope you will find it useful for your work and understanding.