From the course: AI Text Summarization with Hugging Face

Intermediate representations for extractive summarization - Hugging Face Tutorial

From the course: AI Text Summarization with Hugging Face

Intermediate representations for extractive summarization

In this movie, we'll do a quick overview of some of the techniques that can be used to generate the intermediate representation that will then be used for extractive summarization. Representing the original content using an intermediate representation is the first step in the summarization process. Now, the kind of intermediate representation that can be used to represent content is divided into two broad categories; topic words representation and indicator representation. The objective of topic words representation is to identify words that describe the topic of the input argument. And this is a very old tried and tested technique. It's one of the original techniques from back in the 1950s. There are several techniques that fall into this broad category. We have topic words, frequency-based techniques, latent semantic analysis, and Bayesian topic models. I'll explore each of these briefly so you get a big picture understanding of how these works, starting with topic words. In this representation, you use mathematical techniques or statistical techniques to identify important words that are present in the input text. You may use frequency thresholds or the log-likelihood ratio test to identify topic signatures. Once you've identified these topics, you'll then assign an importance score to each sentence, which can be computed in a variety of ways. You may say that the more the number of topics a sentence contains, the more important it is. This technique will of course favor longer sentences which have more topic coverage. Or you can say sentence importance is a function of the proportion of topic signatures a sentence contains. This favors more dense sentences as opposed to longer sentences. Another way to identify important topics in the input content is to use frequency-based techniques. Now there are, of course, different variations of this, but the basic idea is that you assign weights to words in text based on topic representations, and you can use word probability scores as a measure of word importance. Essentially, you'll compute how many times a particular word occurs in the input text? That's the frequency of the word f(w) and divide that by the total number of words in the text. This will give you a word probability score. The more often a word occurs, the more important that word is supposed to be. This, of course, means that you will have to pre-process the input text to get rid of stop words. Stop words are words that add no meaning the, then, than, and so on. As you might imagine, these stop words occur very frequently in text. Finally, once every sentence has an importance score associated with this, the sentences that you select for your summary may be the ones that contain the highest probability words. A variation of the frequency-based intermediate representation is to use TF-IDF scores rather than word probabilities to compute the importance of a sentence. TF-IDF stands for Term Frequency and Inverse Document Frequency. It's a technique that is used to represent words in numeric form. The term frequency will up weigh words which occur very frequently in documents. Let's say the word amazing occurs several times in a document, that word will be up weighed. The inverse document frequency will down weigh words which occur very frequently across the entire corpus, because these words are more likely to be stop words. Another technique that's used in topic words representation that empirically gives good results is latent semantic analysis. This is an unsupervised method for extracting a representation of input text. This technique uses singular value decomposition, a matrix decomposition technique to determine to what extent a particular sentence represents a topic. Latent semantic analysis thus identifies the latent topics that are present in your input text. You can then choose sentences in the summary representing every topic that was identified. And the last technique that we'll discuss here in the broad topic of topic words representation is Bayesian topic models. As the name suggests, Bayesian topic models are probabilistic models that help uncover and represent topics embedded in documents. Summarizers that are built using Bayesian topic models determine the similarities and difference between documents and they score sentences using a measure known as the KL measure, where KL stands for Kullback-Liebler. The KL measure measures the difference or divergence between two probability distributions, say P and Q. This is an interesting method for scoring sentences in the summarization because it shows that good summaries are intuitively similar to the input documents. So far, we've discussed techniques that fall under the broad category of topic words representation. There is also the indicator representation that you can use to represent original content. Here, the input text is modeled in terms of features, and these features are used to rank the sentences in the input text. These features can be anything that conveys the importance of text, sentence length, position in the document when sentences have certain phrases. All of these are features. Again, there are many different techniques you can use to get an indicator representation for your input text, but we'll discuss two briefly; graph methods and machine learning techniques. Let's discuss graph methods first. These techniques represent documents as a connected graph, and these are heavily influenced by the PageRank algorithm, which Google search uses. Unlike in PageRank, which uses links to determine the importance of a particular article, two sentences in your input text are set to be connected if the similarity between them is greater than a certain threshold. These models analyze the entire input text and try to find subgraphs that exist within these texts. And these subgraphs represent topics. Sentences that are thus connected to many other sentences in the input text are considered to be important and should be included in the summary. And finally, we have indicator representations that are generated using machine learning techniques. Here, the actual summarization task is treated as a classification problem where you classify input documents into categories. Essentially, you train a model on the original text and you have it classify sentences in that text as summary sentences or non-summary sentences. The training data that you'd use here would be a number of documents and the extractive summaries for each of those documents.

Contents