From the course: AI Text Summarization with Hugging Face
Accessing the BBC dataset on Google Drive - Hugging Face Tutorial
From the course: AI Text Summarization with Hugging Face
Accessing the BBC dataset on Google Drive
In this demo, we'll continue working with Hugging Face transformer models and we'll try two other models for text summarization. The first model that we'll use is the Google Pegasus model that has been trained on the C4 dataset, as well as a news dataset which includes the CNN Daily Mail articles. We'll have these models generate summaries using data from a different dataset, the BBC News Summary dataset. Let's get started. Now we need to pip install all of the libraries that we need because we're starting with a new Colab notebook. Once again, I'm using the GPU runtime in this notebook. Now we need to pip install a second library here as well, the sentencepiece library. The Pegasus tokenizer needs sentencepiece. Sentencepiece is not actually a tokenizer, it's a method for selecting tokens from a pre-compiled list, optimizing the tokenization process based on a supplied vocabulary corpus. Once you have the required libraries installed, you may need to restart your kernel. That's pretty straightforward. Here in Colab, go to runtime, restart runtime and your kernel restart should ensure that you have all of the libraries that you need and can use them. As before, let's connect to our account on Hugging Face hub by using notebook login. Notebook login needs the token that we had generated earlier. As you remember, these access tokens are present in your profile. We likely need both read as well as write permissions, so I'm going to copy over token summarization write, paste it into this input box and use that to log in and connect to Hugging Face Hub. I've uploaded the data set that I plan to use for summarization to Google Drive, the Google Drive associated with the same account I used to log in to Colab. drive.mount("/content/drive") will mount my Google Drive root folder to Colab. Connecting to Google Drive will allow us to access the dataset that's present in Google Drive. You'll be asked to log in with your account and authenticate yourself. You'll be asked for permissions to allow Colab to access Google Drive. Click on "Allow", and with that, your Drive folder should be mounted. Next, let's confirm that this dataset is indeed present on Google Drive. You can see under My Drive, BBC, I have the BBCNewsSummary.zip file. We'll need to unzip and extract the contents of this file before we can use it. Here is the path to the zip file on my drive. I'm going to import the zip file library and use the zip_ref.extractall method to extract the contents of the zip. The extracted contents will be placed in the My Drive BBC folder. Once the extraction operation is complete, you should find a new subfolder in here called BBC News Summary. And if you click through, you should find subfolders for news articles and the corresponding summaries. Let's take a look at the news articles. You can see that these news articles are split into different topics or categories. Let's look at one of these. Each article is present as a different text file. You can see that the text files are numbered 001, 002, and so on. The corresponding numbered text files in the summaries folder will contain the summaries for these articles. We'll use those summaries as reference summaries to compute ROUGE scores for the summaries that we get from the models that we've chosen.
Contents
-
-
-
-
-
-
-
-
-
Accessing the BBC dataset on Google Drive3m 34s
-
Instantiating and cleaning the BBC News summaries dataset3m 48s
-
Generating summaries using Pegasus4m 55s
-
Generating multiple summaries and computing aggregate ROUGE scores2m 49s
-
Generating summaries using BART3m 19s
-
Computing ROUGE metrics for a set of summaries2m 9s
-
-