From the course: AI Text Summarization with Hugging Face

Evaluating summaries using ROUGE scores - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Evaluating summaries using ROUGE scores

By looking at the summary, we are able to evaluate whether it seems good or not. But that's a very subjective evaluation. For a more objective evaluation of whether a particular summarizer is generating good summaries, we need to evaluate its result using the rouge score. The evaluate library from Hugging Face is a library, as its name suggests, for easily evaluating machine learning models and datasets. With a single line of code, this allows us access to dozens of evaluation metrics for different domains, whether it's NLP, computer vision, or any other. Here we are interested in the rouge metrics for evaluating text summaries. I call evaluate.load and I parse in rouge as the evaluation metric that I'm interested in, and I store the result in a variable called rouge. If you look at the outputs generated when the rouge metrics are loaded, you can see what this evaluation metric returns. It returns rouge1, rouge2, rougeL, and rougeLsum scores for whatever data you parse in. RougeLsum is a variant of the rougeL metric that we've discussed earlier. Here, the longest common subsequence is computed between each pair of reference and candidate sentences, and a metric called Union LCS is computed. And this Union LCS is the rougeLsum. You can see the code that we'll use to compute rouge scores in the examples section. The rouge score is in the range 0 to 1. A rouge score close to 0 indicates poor similarity between candidate and reference, and a rouge score close to 1 indicates strong similarity between candidate and reference. With this in mind, let's see an example of how the rouge metric is computed. Here is the reference sentence on line 1, "the elephant was found near the river and everyone was glad." Then we have the first example, which has a lot of extra words such as turbulent and ecstatic. Example 2 is another candidate sentence with fewer extra words. We'll now compute the rouge scores between the two examples and the reference sentence and you can see what the results look like. Let's first compute the rouge score between the first example with a lot of extra words and the reference sentence. Here, the reference sentence and the candidate are very similar, even though the example, that is, the candidate sentence contains many extra words. So you can see the rouge scores are rather high: 0.83 for rouge1, 0.63 for rouge2, 0.83 for both rougeL and rougeLsum. Let's compute the rouge score for the second candidate and reference sentence. Example 2 contains fewer extra words. Here, you can see that the rouge1 score is higher, almost 0.87. It's 0.869. Rouge2 is a little lower at 0.57. RougeL and RougeLsum are also higher at almost 0.87. So now that you know how rouge scores work, let's take a look at the summary that was generated for the article about cloning the Labrador retriever. So this is the summary. Let's compute the rouge score of this summary against the original reference text, that is, the summary that is present in the dataset. Use_stemmer = True means that a Porter stemmer will be applied to the candidate and reference sentences before route scores are computed. The Porter stemmer will strip out suffixes of words so that words such as talk and talking will be considered to be the same word. The rouge score for our summary compared to the reference summary is only about 0.13, so it isn't a great score.

Contents