From the course: AI Text Summarization with Hugging Face

Instantiating and cleaning the BBC News summaries dataset - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Instantiating and cleaning the BBC News summaries dataset

The data that we plan to use for summarization is now available on Google Drive. However, we need to create a dataset object so that we can use that to feed into our summarization model. And that's what we'll do here in this movie. I've defined a function called extract, which takes in as an input argument the path to a single file, whether it's the original article or the summary. Using this path, I've defined a regular expression pattern on line 4 to be able to extract the category that the file belongs to and the ID of the text file. The ID of the text file is simply its name, 001.txt and so on. We do a regular expression search on line 6 to extract the category and file ID, and then we open up the file at the file path, read its contents into the text variable and return the category file ID and text. Next, we perform a set of operations to get the articles and the summaries in a single DataFrame. The dataset path points to the root path where the articles and the summaries are present, that is, BBC News Summary is the name of the folder. On lines 6 and 7, we extract the articles_data and summaries_data in a list format, and on lines 9 and 10, we convert those lists to DataFrames. The articles DataFrame has three columns; category, ID, and article. The summaries DataFrame has category, ID, and summary. On line 12, I perform a join operation using the Pandas merge function to combine these two DataFrames into a single DataFrame on the category and ID columns. And this is what the joined data frame looks like, news_summary_df. At this point we have the original article and its summary in the same DataFrame. They are just different columns of the same DataFrame. Let's take a look at a sample article from this DataFrame. I've picked the article at ID 10. So here is the original article. It's something about actress, Julia Roberts. You can see there are quotes, backslashes, and a whole bunch of other characters that we do not want. This text requires some cleaning. Let's take a look at the summary for this same article, the article at Index 10. You can see the summary here. Once again, the summary also has some additional characters which we need to clean. Before we perform any data cleaning, let's convert the data that we have in a DataFrame to be represented using a Hugging Face dataset object. And you can do this very easily by invoking the dataset from Pandas function. You can see this on line 3. Running this code cell will give us a dataset. Notice the features; category, ID, article, and summary. And there are a total of 2,225 rows in this dataset. Next, I've defined a clean text function. This function is exactly the same as the function that we've seen in the previous demo. I convert the text to lowercase, replace backslashes, forward slashes, new line characters, quotes, all with the empty string. For each record, we do the data cleaning for both the article as well as the summary. Now, applying this clean text operation simply involves invoking the map function and parsing in the clean text function on our dataset. Let's confirm that our data is indeed cleaned. Here is a sample from the original dataset at index 0. You can see it contains a lot of new lines, it contains quotes, backslashes, and so on. Let's take a look at the cleaned version of the same article. This dataset is in the cleaned_news_summary variable at index 0. Here is the cleaned version of the same article. Everything in lowercase, no new lines, no extra code characters.

Contents