From the course: AI Text Summarization with Hugging Face
Generating summaries using Pegasus - Hugging Face Tutorial
From the course: AI Text Summarization with Hugging Face
Generating summaries using Pegasus
We are now ready to access and use the Google Pegasus model for text summarization. And for that, we need two classes from the Hugging Face Transformers library. PegasusForConditionalGeneration, this will allow us to access the pre-trained model and the PegasusTokenizer will allow us to access the tokenizer for the model. Now the path to the Pegasus model is available here on Hugging Face, google/pegasus-cnn_dailymail. Pegasus models are sequence-to-sequence models with an encoder-decoder model architecture. The pre-training task used for Pegasus is similar to summarization. Important sentences are removed or masked from an input document and are generated together as one output sequence from remaining sentences. So this task is similar to performing an extractive summary. On line 3, we access the tokenizer for the Pegasus model by calling PegasusTokenizer.from_pretrained and parsing in the model path. And in a similar way, on line 4, we access the model itself using PegasusFor ConditionalGeneration.from_pretrained. This will download the model and the tokenizer and make it available here in your Colab notebook. Based on the information on the model card for this particular Pegasus model, this was pre-trained on the C4 dataset as well as a dataset called Huge News. And the Huge News dataset includes the CNN Daily Mail dataset, which we used in the previous demo, which is why we are working with a different dataset in this demo, the BBC Summaries dataset. If you just print out the contents of this tokenizer object, you'll get an overview of what the tokenizer looks like, the special tokens that it uses, the maximum length of the model, which is 1024. The vocabulary for this tokenizer is 96,103. Now let's take a look at the model itself. We simply print out the contents of this model. This is a PyTorch model and, again, a heads up that we'll be using PyTorch to get predictions from this model. You get an overview of the layers in the model. You can see there is a shared embedding layer and then there is an encoder. And if you scroll down below, you'll be able to see the Pegasus decoder. We are now ready to put our Pegasus model through its paces. Let's get it to summarize a single article. This is the article at Index 5. I access this article from the cleaned news summary dataset and I store it in the example text variable. This seems to be an article about movies from the year 2004. All right, we have our article. Let's instantiate our Pegasus pipeline. On line 3, we instantiate the pipeline object, specify summarization as the task that we want to perform, the model is the Pegasus model, truncation = True. So if our input sentences have tokens that are longer than the token length supported by the model, those sentences will be truncated. On line 4, I invoke the summarizer on the example text to get the summary. Notice that I do not need to specify a prefix before the example text because this particular model is meant for summarization and does not perform other tasks. Let's take a look at the summary that this produces. You may see a warning here that says some weights were not initialized from the model checkpoint. This means that if you're using this model on a production task, you should probably fine-tune this model on your dataset before you can use it. I tried fine-tuning this model using the resources available on Colab, the GPU that is made available to us, and basically, I found it very hard. Even a single epoch took several hours. Now, you're familiar with how to fine-tune a model from the previous demo. If you have more resources available to you, I suggest you try fine-tuning this model on a dataset of your choice, maybe the CNN Daily Mail. Here is the summary generated by this model. It seems to be a decent summary, but we'll only know once we compute the rouge score. Let's access the reference text, that is, the summary available along with the dataset itself. This is the reference text and we have the candidate text that is the summary generated by the model. Let's use rouge to evaluate the summary. You'll need to load rouge metrics into this Colab notebook. I call rouge.compute as we've done before, parse in the predictions from the model, the reference text. Let's take a look at the summary result, and you can see that the rouge score is pretty decent. It's 0.46 for rouge1, 0.38 for rouge2, and 0.25 for rougeL and rougeLsum. So in spite of the model weights not being perfectly initialized, the summary that this model generated was a fairly decent one.
Contents
-
-
-
-
-
-
-
-
-
Accessing the BBC dataset on Google Drive3m 34s
-
Instantiating and cleaning the BBC News summaries dataset3m 48s
-
Generating summaries using Pegasus4m 55s
-
Generating multiple summaries and computing aggregate ROUGE scores2m 49s
-
Generating summaries using BART3m 19s
-
Computing ROUGE metrics for a set of summaries2m 9s
-
-