Extractive Summarization

Dev Aggarwal
6 min readJul 2, 2023

--

Have you ever been annoyed about having to read a super long document? Or maybe you forget what you were reading about as you pore through some mind-numbing material. Or maybe you read a super long document only to realize that it doesn’t even contain what you’re looking for.

Source: imgflip

That’s where summaries come up into play! They improve efficiency by cutting down time and letting you know what you’re about to read beforehand. But unfortunately, not every document / article contains a summary to go along with it.

Automative Text Summarization

Designed to capture the essence of a document, and express it in a shorter amount of words. There are many different algorithms that fall under the umbrella technique of summarization.

They are classified as either:

  1. Extractive summarization:
    Determines the sentences most important to a text document and groups them together. The sentences in the target summary are identical to that in the original text document, but the method of analysis determines the degree to which a sentence is recognized as important or essential to the text document.
    Ex: Frequency
  2. Abstractive summarization:
    Uses deep learning to generate a summary of the document that does not contain identical sentences. The newly generated summary may contain new terms, phrases, sentences, etc. but the main points are now expressed in a concise manner.

Extractive Summarization Technique — TextRank Algorithm

There are different implementations of the TextRank algorithm depending on which informational unit is chosen. Text is broken up into units such as either words, phrases, sentences, paragraphs, etc. Analysis of the informational units in comparison to one another allows for the relationship between sentences to quantitatively measured. This is important because quantitative measurements allow for a graph-based ranking model to be formed. The nodes are the sentences themselves while the weights associated with each edge are the degree of similarity to other nodes.

Thus, if we take the size of an information unit to be a sentence in this instance, then the first step is to create a matrix that records the similarity of every sentence in relation to all other sentences as seen below. There are different ways of measuring similarity. The method below determines the overlap between 2 sentences by dividing the total amount of shared tokens by the logarithm of the 1st sentence’s length + the logarithm of the 2nd sentence’s length to avoid promoting long sentences.

# data variable simulates large amount of text
data = 'Quantum computing is a revolutionary paradigm that utilizes the principles of quantum mechanics to process and manipulate information. At its core, quantum computing harnesses the unique properties of quantum bits, or qubits, which can exist in a superposition of both 0 and 1 states simultaneously. This allows quantum computers to perform computations in parallel, potentially providing exponential speedups for certain types of problems compared to classical computers. One of the key concepts in quantum computing is entanglement. When qubits become entangled, their states become correlated in such a way that the state of one qubit cannot be described independently of the others. This property enables quantum computers to manipulate and process information in a highly interconnected manner, leading to the potential for solving complex problems more efficiently. However, quantum computing faces significant challenges. Quantum systems are susceptible to errors caused by decoherence, which is the loss of quantum coherence due to environmental interactions. Ensuring the reliability and stability of qubits is a major technical hurdle. Additionally, building large-scale quantum computers remains a significant engineering feat due to the delicate nature of qubits and the need for precise control over quantum states. Despite these challenges, quantum computing holds immense promise. It has the potential to revolutionize fields such as cryptography, optimization, simulation, and machine learning, tackling problems that are currently intractable for classical computers. Ongoing research and development in quantum computing are focused on improving qubit stability, developing error-correction techniques, and exploring new technologies to build more powerful and scalable quantum systems.'
if(data[-1]=='.'):
data = data[:-1]
sentences = data.split('.')
similarityScore = np.zeros((len(sentences), len(sentences)))

tokenizer = RegexpTokenizer(r'\w+')

for s1 in range(0,len(sentences)-1):
for s2 in range(s1+1, len(sentences)):
if(s1 == s2):
continue
sentence1 = tokenizer.tokenize(sentences[s1])
sentence2 = tokenizer.tokenize(sentences[s2])
count = 0
for x in range(len(sentence1)):
for y in range(x, len(sentence2)):
if(sentence1[x]==sentence2[y]):
count+=1
similarityScore[s1, s2] = count/(math.log(len(sentence1))+math.log(len(sentence2)))
similarityScore[s2, s1] = similarityScore[s1, s2]

# NOTE: The matrix is symmetrical across the diagonal

The next step is to feed the similarityScore matrix into the PageRank model.

PageRank Algorithm

The PageRank algorithm developed by Google is a way of assessing the quality of a website. The PageRank algorithm was in use by Google until 2014 in order to rank the results of search queries and was responsible for organizing the search results page, the first of which is particularly important.

The PageRank algorithm operated under 1 fundamental assumption:

The best quality sites have the most widespread links in a network

If a website contains a link to another site, then the 2nd website has been evaluated and recommended by the 1st website. Therefore, the website whose URL is most referred by other sites, has the greatest exposure in a network and is most likely to be the best quality website.

Source: COMP 130

We will first test the effects of the PageRank algorithm from the NetworkX library and later compare it a personally coded version of the Page Rank algorithm.

nx_graph = networkx.from_numpy_array(similarityScore)
results = networkx.pagerank(nx_graph)
ranked_sentences = sorted(((results[i],s) for i,s in enumerate(sentences)), reverse=True)

for i in range(5):
print(ranked_sentences[i][1], end=". ")

-----------------------------------OUTPUT-------------------------------------
# Additionally, building large-scale quantum computers remains a significant
# engineering feat due to the delicate nature of qubits and the need for
# precise control over quantum states. Ongoing research and development in
# quantum computing are focused on improving qubit stability, developing
# error-correction techniques, and exploring new technologies to build more
# powerful and scalable quantum systems. Quantum computing is a revolutionary
# paradigm that utilizes the principles of quantum mechanics to process and
# manipulate information. This property enables quantum computers to
# manipulate and process information in a highly interconnected manner,
# leading to the potential for solving complex problems more efficiently.
# Quantum systems are susceptible to errors caused by decoherence, which is
# the loss of quantum coherence due to environmental interactions.

This summary from the official PageRank algorithm of the NetworkX library surprisingly seems to be a mess. The method for measuring the weights is basic at best, and can certainly be optimized. For example, stop words such as “the” should definitely be removed because they add to the similarity score without possessing any importance, with an accurate measure of the latter attribute (importance) being the key to a successful automatic summarization technique.

A simple implementation of the PageRank algorithm is below, with code taken from YouTube. Read more about the PageRank algorithm from the sources at the bottom.

def pagerank(trans_mat, eps=0.0001, d=0.85):
prob = np.ones(len(trans_mat)) / len(trans_mat)
count = 0
while count < 20:
new_prob = np.ones(len(trans_mat)) * (1-d) / len(trans_mat) + d*trans_mat.T.dot(prob)
delta = abs(new_prob-prob).sum()
if delta <= eps:
return new_prob
prob = new_prob
count +=1
return prob

results = pagerank(similarityScore)
sorted_results = np.sort(results)
sorted_results = sorted_results[::-1]
print("\n")
for x in range(len(results)):
if results[x]==sorted_results[0]:
print(sentences[x], end=". ")
elif results[x]==sorted_results[1]:
print(sentences[x], end=". ")
elif results[x]==sorted_results[2]:
print(sentences[x], end=". ")
elif results[x]==sorted_results[3]:
print(sentences[x], end=". ")
elif results[x]==sorted_results[4]:
print(sentences[x], end=". ")

-----------------------------------OUTPUT-------------------------------------
# Quantum computing is a revolutionary paradigm that utilizes the principles of
# quantum mechanics to process and manipulate information. This property
# enables quantum computers to manipulate and process information in a highly
# interconnected manner, leading to the potential for solving complex problems
# more efficiently. Quantum systems are susceptible to errors caused by
# decoherence, which is the loss of quantum coherence due to environmental
# interactions. Additionally, building large-scale quantum computers remains
# a significant engineering feat due to the delicate nature of qubits and the
# need for precise control over quantum states. Ongoing research and
# development in quantum computing are focused on improving qubit stability,
# developing error-correction techniques, and exploring new technologies to
# build more powerful and scalable quantum systems.

Wow! This custom PageRank algorithm seems to have produced a much better summary than the results seen from NetworkX’s PageRank algorithm. The extent to its success in relation to the previous PageRank algorithm is shocking. Unfortunately, the why behind this custom-built PageRank algorithm show much better results is a mystery, but it’s certainly something interesting to look into.

Sources:

https://www.youtube.com/watch?v=qtLk2x59Va8

https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html#:~:text=The%20textrank%20algorithm%20(keyword%20extraction,broad%20range%20of%20R%20packages.

--

--

Dev Aggarwal
Dev Aggarwal

Written by Dev Aggarwal

Tennis player, bookworm, programmer that can't wait to learn more, do more

No responses yet