From the course: AI Text Summarization with Hugging Face

Transformers in Hugging Face - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Transformers in Hugging Face

When we discussed Hugging Face earlier, we mentioned that Hugging Face is best known for developing and maintaining the Hugging Face Transformers library, an open-source NLP library built on top of PyTorch and TensorFlow. In the demos that follow, we'll be using models from this library for abstractive summarization. The Hugging Face Transformers library makes it very easy to download and use a state-of-the-art NLP model for inference. Hugging Face allows you to access these models using a very simple Python API that you can simply pip install on your machine. All models on Hugging Face are generally PyTorch or TensorFlow models, so you're working with frameworks that you're likely familiar with. And you will see in just a bit how simple and straightforward it is for you to use models hosted on Hugging Face. When you work with NLP models on Hugging Face, you'll be instantiating and using a Hugging Face pipeline. A pipeline is a high-level interface provided by the Hugging Face Transformers library that allows users to easily perform various NLP tasks using pre-trained models. Pipelines provide a simple and convenient way to use these models without the need for extensive coding or without extensive knowledge of model architectures. A pipeline encapsulates all of the steps needed to work with a model. A tokenizer will preprocess and prepare your input data so that it's in a form that can be fed into the model. Feeding the data into the model will give you predictions. These predictions are then passed to a post-processing stage which will give you the output in a format that you can consume and understand. Each stage in this pipeline is individually accessible, which means you can access, use, and tweak these individual stages in the pipeline based on your use case. Let's say you decide to access and use a classification model from the Hugging Face library, here is what the output of different stages might look like. You start with raw text that is tokenized and converted to input IDs which are then fed into the model. The model will output logits. Logits kind of represent the probability scores for the class or category predicted by the model. This is then post-processed and then you get a prediction. The first model that we'll use for text summarization is the T5 Small Model. This is the text-to-text transfer transformer. It's called a text-to-text transformer model because the input and the output to these models are both strings, no matter what NLP task you performed. And the cool thing with the T5 Model is that it can be used for a variety of different natural language processing tasks. All NLP tasks have been reframed to a unified text-to-text format. You can use the same model, same loss function, and same hyperparameters on any NLP task, including machine translation, document summarization, question answering, and even classification. The dataset used to pre-train this model is the C4 dataset. Now there are some standard text datasets used for pre-training models, but they all have their own strengths and weaknesses. For example, text from Wikipedia is high quality but uniform in style, and it's a relatively small dataset, whereas the Common Crawl datasets which scrapes the web, that's an enormous dataset, it's very diverse, but the quality is not that great. C4 stands for Colossal Clean Crawled Corpus. It's a cleaned version of Common Crawl that's two orders of magnitude larger than Wikipedia, and that is what has been used to train the T5 Model. The second model that we'll use from the Hugging Face Transformers library is a Pegasus model that has been fine-tuned on the CNN Daily Mail dataset. The Pegasus model is a sequence-to-sequence model using the encoder-decoder architecture that we've discussed here. Now, the pre-training task that was used to train this model is very similar to summarization. Pegasus is primarily a summarization model. The Pegasus model that we'll use has been pre-trained using two different techniques. The first is MLM that stands for Masked Language Modeling. This is where the encoder input tokens are randomly replaced by a mask token, and this mask token has to be predicted by the encoder. It's also been pre-trained using GSG, Gap Sentence Generation, where the whole encoder input sentences are replaced by a second mask token and fed into the decoder, and the decoder has to use this to generate the output sequence. And finally, the last transformer model that we'll use for summarization is going to be the BART Large CNN, a Bart model fine-tuned on the CNN Daily Mail dataset. Once again, this is a sequence-to-sequence model which uses the encoder-decoder architecture. It has been pre-trained on the English language and fine-tuned on the CNN Daily Mail dataset. So when we use this with news articles, you'll see that it performs very well. This model works very well for summarization and translation tasks.

Contents