From the course: AI Text Summarization with Hugging Face
Generating multiple summaries and computing aggregate ROUGE scores - Hugging Face Tutorial
From the course: AI Text Summarization with Hugging Face
Generating multiple summaries and computing aggregate ROUGE scores
Now that we've generated a candidate summary for a single example text and computed the rouge score for that summary, let's do it for a set of articles. We'll generate summaries for 50 articles and compute the aggregate rouge scores across all of these 50 summaries. In the active code cell, I extract the article_text and article_summaries in two different variables. Once again, we'll use the tqdm or taqadum library to show a progress bar. I use a for loop to iterate over the first 50 articles. On line 6, I invoke the summarizer on the article. Once again, no prefix text is required because this model is primarily a summarizer. It doesn't perform other tasks. And I append every summary to the candidate_summaries list. I'm going to speed up this generation of summaries, but it took about 30 minutes when I ran it on Colab on a GPU. Now that we have the summaries, let's compute the aggregate rouge scores across these 50 summaries. Parse in the candidate summaries from the model, the reference summaries from our dataset. I use a stemmer to cut off word suffixes. So words such as expand and expanding will basically be treated as the same token. And here are the aggregate rouge scores. Rouge1 is 0.33, rouge2, 0.23 and rougeL and rougeLsum, 0.24. Let's find the best and worst aggregate summary across these 50 articles. Call rouge.compute and specify use_aggregator = False so that we get the unaggregated rouge scores. I'll now get the best candidate summary and the worst candidate summary generated by our model. And we'll evaluate these candidate summaries using the rougeLsum metrics. We'll get the index positions of the best and worst summaries. The best summary is at Index 35, the worst at Index 22. Let's set up the candidate summaries from the model and the reference summaries into a single DataFrame. We have predicted_summaries and reference_summaries. We can now use this DataFrame to access the candidate summaries and the reference summaries at the two indices, one for the best summary and one for the worst. At Index 35, we have the summary with the best rougeLsum score. Let's look at the actual summary and the reference summary and you can see that there are lots of overlapping words. The summary is clearly a good one. The worst summary is at Index 22. Let's take a look and we can compare them for ourselves. It's about Mary Poppins and you can see that there are very few overlapping words and the summary is also really, really short. Clearly, this summary did not get a great rouge score.
Contents
-
-
-
-
-
-
-
-
-
Accessing the BBC dataset on Google Drive3m 34s
-
Instantiating and cleaning the BBC News summaries dataset3m 48s
-
Generating summaries using Pegasus4m 55s
-
Generating multiple summaries and computing aggregate ROUGE scores2m 49s
-
Generating summaries using BART3m 19s
-
Computing ROUGE metrics for a set of summaries2m 9s
-
-