Text Summarizer

Overview

Text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning. Assume there is a document \(D\) that consists of \(n\) sentences: \((S_0, S_1, ..., S_{n-1})\). The problem of text summarization can be formulated as creating another document \(D’\), where \(D’ = (S’_0, S’_1, ..., S’_{m-1})\) and \(m \le n\).

In general, there are two categories of text summarization: extractive summarization and abstractive summarization. For \(i\) in \([0, m)\), if \(S’_i\) is in \({S_0, S_1, ..., S_{n-1}}\), the summarization is extractive summarization. For \(j\) in \([0, m)\), if \(S’_j\) is not in \({S_0, S_1, ..., S_{n-1}}\), the summarization is abstractive summarization.

To create a summarization, the conventional approaches are to first rank the sentences in \(D\). Then the top \(m\) scoring sentences in \(D\) are selected as the summary \(D’\). The key to these approaches is how to define a sorting metric.

Another summarization approach is k-means. Each sentence in \(D\) is represented as a numerical vector, and \(D\) is modeled as \(m\) clusters of numerical vectors. The distance metric can be Euclidean, Cosine, or Manhattan distance. The sentences closest to the \(m\) cluster centroids are chosen as the sentences in \(D’\). The numerical representations of \(D’\) sentences can be generated from sentences’ embeddings, for example, Gensim’s Doc2Vec.

The extractive method is more practical because the summaries it creates are more grammatically correct and semantically relevant to the document. So, the library’s text summarizers take the extractive approach.

The library’s text summarizer implements two versions of extractive summarizers: JaccardSummarizer and KMedoidsSummarizer.

Custom vocabulary

The library’s text summarizer can be customized for specific use-cases. For example, an analyst might want to use distinct summarizers with their own vocabulary.

To achieve this, you can simply specify a custom vocabulary list. During a processing job of the library’s summarizer, only the words in the custom vocabulary are extracted.

This vocabulary customization feature is implemented in JaccardSummarizer. You can specify your own vocabulary when instantiating a Summarizer with JaccardSummarizerConfig.

Jaccard summarizer

This summarizer first preprocesses the document in question to obtain a set of tokens for each sentence in the document. The preprocessing is based on a bag of words model. The document is first segmented into a list of sentences by Natural Language Toolkit’s (NLTK) sent_tokenize method. Then each sentence is further tokenized by NLTK’s regexp_tokenize method, which removes numbers, punctuation, and spaces from the sentence. Next, stop words are removed and stemming is applied to the remaining tokens.

JaccardSummarizer is a traditional summarizer. It scores the sentences in a document measuring similarities. The sentences with higher similarities to other sentences in the documents are ranked higher. The top scoring sentences are selected as the summary of the document.

More specifically, the similarity is calculated in terms of Jaccard similarity. The Jaccard similarity of two sentences A and B is the ratio of the size of intersection of tokens in A and B vs the size of union of tokens in A and B.

\[J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A|+|B|-|A \cap B|}\]

Accordingly, it calculates a symmetric Jaccard similarity matrix for the sentences in the document. Each row and column in the matrix correspond a sentence in the document.

\[\begin{split}J_{similarity} = \left[ \begin{array}{cccc} j_{0,0} & j_{0,1} & ... & j_{0,n-1} \\ j_{1,0} & j_{1,1} & ... & j_{1,n-1} \\ ... & ... & ... & ... \\ j_{n-1,0} & j_{n-1,1} & ... & j_{n-1,n-1} \\ \end{array} \right]\end{split}\]

Finally, the score of a sentence is the row sum of this sentence’s similarities to all other sentences in the document.

\[score_i = \sum_k j_{ik}\]

Then the top \(m\) sentences with the highest similarities are selected via heap sorting or quick select.

To find the API reference for this summarizer, see Summarizer and JaccardSummarizerConfig.

K-medoids summarizer

KMedoidsSummarizer is a k-means based approach. It creates the sentence embeddings using Gensim’s Doc2Vec. Then it uses the k-medoids algorithm to determine the \(m\) sentences in the document closest to the cluster centroids.

To find the API reference for this summarizer, see Summarizer and KMedoidsSummarizerConfig.