From the course: AI Text Summarization with Hugging Face

A brief introduction to Transformers - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

A brief introduction to Transformers

Now, transformers are an advanced and complex neural network architecture used in great effect in natural language processing tasks. Here in this movie, we'll briefly understand how transformers work. In the previous movie, we understood the role of attention in sequence-to-sequence models, and transformers use attention or self-attention to detect how different parts of the input sequence are related to one another, even when the relationship is very subtle. The basic transformer architecture is made up of two components; the encoder and the decoder. Now, this seems to imply that transformers are sequence-to-sequence models, but that's not necessarily the case. Sequence-to-sequence models use both parts of a transformer architecture, the encoder as well as the decoder. As we've discussed earlier, text summarization is performed using sequence-to-sequence models. The encoder in the transformer architecture receives an input and builds a representation of the input, and the entire encoder is optimized to understand what the input is about. There are encoder models that use only the encoder part of the transformer architecture. Encoder models are best suited for tasks requiring an understanding of the full input sentence, such as sentence classification tasks, named entity recognition, and extractive question answering. The decoder portion of the transformer architecture uses the encoder's representation to generate a target sequence, and the decoder component is optimized for generating outputs. Decoder models use only the decoder component of a transformer. These models are best suited for tasks that involve text generation. The original transformer architecture was first introduced by Google researchers in a paper called "Attention Is All You Need", and this is what the transformer architecture looked like. Any powerful NLP model you talk about today is likely using a transformer architecture. In the demos that follow, we'll be using a few different models for abstractive text summarization. All of those models are transformers. The transformer architecture comprises of an encoder and decoder, and both use attention. There are attention blocks in both the encoder as well as the decoder so that the model can focus on the relevant portions of the text. Another thing to know is that transformers are very large models with millions of parameters, which means it's very expensive to train them, expensive in terms of time and expensive in terms of resources. Training transformer architectures have a huge environmental impact. Given not everyone has access to all of these resources, it's going to be very unlikely that you train a transformer model from scratch. So here is what training a transformer from scratch looks like. Training from scratch is referred to as pre-training. You start with a basic model architecture, feed in a large corpus of data, spend a lot in compute, spend many days to train that model and you get the pre-trained language model. You may not have the time, data, or resources to train a transformer model from scratch, but you can definitely use a pre-trained model and fine-tune it on your data. You start with the weights and architecture of a model that has already been pre-trained and then you perform additional training, that is, fine-tuning with a dataset that is specific to your task. Fine-tuning a model is an example of transfer learning. Fine-tuning leverages the learning from the original pre-trained model, and that learning is transferred to your new model that has been fine-tuned on a dataset that is more relevant to you.

Contents