From the course: AI Text Summarization with Hugging Face

Summarizing text and computing aggregate ROUGE scores - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Summarizing text and computing aggregate ROUGE scores

Instead of generating summaries one article at a time, I'm going to take a small set of articles from the test data that we've carved out from our original data set and generate summaries for 50 or so articles. We'll then compute the rouge scores for all 50 summaries and see what they look like. What I've done here is accessed the text and the summaries of articles from our test dataset that's in the active code cell. Next, I run a little for loop to generate summaries for 50 samples from our test data. I've only chosen 50 samples here because it takes a long time for the summarizer to generate summaries. Running on a single GPU, each summary will take about 30 or so seconds. 50 summaries will take about 20 to 25 minutes. But first, we import the tqdm Python library. tqdm is short for taqadum, which essentially will show us a progress bar as we iterate over this for loop. On line 7, I run a for loop over the first 50 articles in our test dataset. For each article, we generate a candidate summary by parsing in the prefix plus the text. Remember, we have to parse in the summarized prefix for this model. We'll append every candidate summary that was generated to the candidate summaries list. Let's go ahead and run this code and this will take about 20 or so minutes to run through. If you're running the code along with me, well, you'll have to be a little patient and wait for all of the summarization to be completed. There's a little warning here that you can safely ignore. By default, the max length of our summary will be 200 tokens. However, there is one input article that's just 171 tokens, and that's what this warning is about. Now that we have the candidate summaries and we have the original article summaries, let's use rouge.compute to compute the rouge scores. This will give us the aggregate rouge scores across all 50 summaries generated. And you can see rouge1 is a little better here, 0.32. Rouge2 is 0.139, not great. RougeL and RougeLsum are also around 0.24. Again, improved from what we had previously with just one candidate summary. Now you can also get an aggregated rouge score so that you get a rouge score for each individual candidate summary versus its reference. The main change here is that in the rouge.compute, I've parsed in use_aggregator = False. Let's take a look at the unaggregated results and you can see that I get rouge scores for each individual candidate summary. Let's say the metric that you want to use to evaluate your candidate summaries is rouge2, which considers bigrams when checking to see whether candidate summaries are similar to reference summaries. I've used np.argmax and argmin to find the candidate summary with the best rouge2 score and the worst rouge2 score. Let's take a look at what these indices are. You can see 38 has the best rouge2 score and 12 has the worst. These are both indices. What I'll do next is set up a DataFrame with two columns; predicted summaries from our T5 model and reference summaries from our test data. So act_vs_pred_summaries_df.head() is the name of the DataFrame. Now that we have this in a DataFrame, let's look at the candidate summary with the best rouge2 score. Remember this is at Index 38 and we'll compare it with the reference summary. You can look at the actual summary generated by our model and the reference summary and you can see that there are many similarities. This is clearly a good summary that our summarizer has produced. Let's now look at Index 12. This had the worst rouge2 score. And even just glancing at these two summaries, you can see that the actual summary looks very different from the reference summary, which is why it had a low rouge2 score.

Contents