Text summarization is a natural language processing (NLP) task that allows users to summarize large amounts of text for quick consumption without losing any important information.
We’ve all come across articles and other long-form texts with a lot of unnecessary content that completely draws us away from the subject matter.
This can get frustrating, especially during research and when collecting valid information for whatever reason. The solution? Text summarization.
With this in mind, let’s first look at the two distinctive methods of text summarization, followed by five techniques that Python developers can use.
‘Extractive’ and ‘Abstractive’ are the two methods of performing text summarization. Let’s discuss them in detail.
As the name suggests, extractive text summarization ‘extracts’ notable information from the large dumps of text provided and groups them into clear and concise summaries.
The method is very straightforward as it extracts texts based on parameters such as the text to be summarized, the most important sentences (Top K), and the value of each of these sentences to the overall subject.
This, however, also means that the method is limited to predetermined parameters that can make extracted text biased under certain conditions.
Owing to its simplicity in most use cases, extractive text summarization is the most common method used by automatic text summarizers.
Abstractive text summarization generates legible sentences from the entirety of the text provided. It rewrites large amounts of text by creating acceptable representations, which is further processed and summarized by natural language processing.
What makes this method unique is its almost AI-like ability to use a machine’s semantic capability to process text and iron out the kinks using NLP.
Although it might not be as simple to use compared to the extractive method, in many situations, abstract summarization is far more useful. In a lot of ways, it is a precursor to full-fledged AI writing tools. However, this does not mean that there is no need for extractive summarization.
Here are five approaches to text summarization using both abstractive and extractive methods.
Gensim is an open-source topic and vector space modeling toolkit within the Python programming language.
First, the user needs to utilize the summarization.summarizer from Gensim as it is based on a variation of the TextRank algorithm.
Since TextRank is a graph-based ranking algorithm, it helps narrow down the importance of vertices in graphs based on global information drawn from said graphs.
Here’s an example code to summarize text from Wikipedia:
from gensim.summarization.summarizer import summarize from gensim.summarization import keywords import wikipedia import en_core_web_sm
To import the wikipedia content:
wikisearch = wikipedia.page("") wikicontent = wikisearch.content nlp = en_core_web_sm.load() doc = nlp(wikicontent)
To summarize based on percentage:
summ_per = summarize(wikicontent, ratio = “”) print("Percent summary") print(summ_per)
To summarize based on word count:
summ_words = summarize(wikicontent, word_count = “”) print("Word count summary") print(summ_words)
There are two ways of extracting text using TextRank: keyword and sentence extraction.
Keyword extraction can be done by simply using a frequency test, but this would almost always prove to be inaccurate. This is where TextRank automates the process to semantically provide far more accurate results based on the corpus.
Sentence extraction, on the other hand, studies the corpus to summarize the most valid sentences pertaining to the subject matter and phonologically arranges it.
Sumy is another library in Python that uses various algorithms to perform text summarization.
Let’s take a look at a few of them.
LexRank is a graphical-based summarizer. The code is as follows:
from sumy.summarizers.lex_rank import LexRankSummarizer summarizer_lex = LexRankSummarizer() # Summarize using sumy LexRank summary= summarizer_lex(parser.document, 2) lex_summary="" for sentence in summary: lex_summary+=str(sentence) print(lex_summary) print(text_summary)
Developed by an IBM researcher of the same name, Luhn is one of the oldest summarization algorithms and ranks sentences based on a frequency criterion for words.
Here’s the code for the algorithm:
from sumy.summarizers.luhn import LuhnSummarizer summarizer_1 = LuhnSummarizer() summary_1 =summarizer_1(parser.document,2) for sentence in summary_1: print(sentence)
Latent semantic analysis is an automated method of summarization that utilizes term frequency with singular value decomposition. It has become one of the most used summarizers in recent years.
The code is as follows:
from sumy.summarizers.lsa import LsaSummarizer summarizer_lsa = LsaSummarizer() # Summarize using sumy LSA summary =summarizer_lsa(parser.document,2) lsa_summary="" for sentence in summary: lsa_summary+=str(sentence) print(lsa_summary)
And last but not least, there is TextRank which works exactly the same as in Gensim.
Here’s the code for this algorithm:
# Load Packages from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # For Strings parser = PlaintextParser.from_string(text,Tokenizer("english")) from sumy.summarizers.text_rank import TextRankSummarizer # Summarize using sumy TextRank summarizer = TextRankSummarizer() summary =summarizer_4(parser.document,2) text_summary="" for sentence in summary: text_summary+=str(sentence) print(text_summary)
When using each of these summarizers, you will notice that they summarize text differently. It’s better to try them all to figure out which one works best in different situations.
The ‘Natural Language Toolkit’ is an NLP-based toolkit in Python that helps with text summarization.
Here’s how to get it up and running.
Import the required libraries using the code below:
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize
Input your text for summarizing below:
text = """ """
Next, you need to tokenize the text:
stopWords = set(stopwords.words("english")) words = word_tokenize(text)
Now, you will need to create a frequency table to keep a score of each word:
freqTable = dict() for word in words: word = word.lower() if word in stopWords: continue if word in freqTable: freqTable[word] += 1 else: freqTable[word] = 1
Next, create a dictionary to keep the score of each sentence:
sentences = sent_tokenize(text) sentenceValue = dict() for sentence in sentences: for word, freq in freqTable.items(): if word in sentence.lower(): if word in sentence.lower(): if sentence in sentenceValue: sentenceValue[sentence] += freq else: sentenceValue[sentence] = freq sumValues = 0 for sentence in sentenceValue: sumValues += sentenceValue[sentence]
Now, we define the average value from the original text as such:
average = int(sumValues / len(sentenceValue))
And lastly, we need to store the sentences into our summary:
summary = '' for sentence in sentences: if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)): summary += " " + sentence print(summary)
To make use of Google’s T5 summarizer, there are a few prerequisites.
First, you will need to install PyTorch and Hugging Face’s Transformers. You can install the transformers using the code below:
pip install transformers
Next, import PyTorch along with the AutoTokenizer and AutoModelWithLMHead objects:
import torch from transformers, import AutoTokenizer, AutoModelWithLMHead
Next, you need to initialize the tokenizer model:
tokenizer = AutoTokenizer.from_pretrained('t5-base') model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)
From here, you can use any data you like to summarize. Once you have gathered your data, input the code below to tokenize it:
inputs = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=512, truncation=True)
Now, you can generate the summary by using the model.generate function on T5:
summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)
Feel free to replace the values mentioned above with your desired values. Once it’s ready, you can move on to decode the tokenized summary using the tokenizer.decode function:
summary = tokenizer.decode(summary_ids[0])
And there you have it: a text summarizer with Google’s T5. You can replace the texts and values at any time to summarize various arrays of data.
GPT-3 is a successor to the GPT-2 API and is much more capable and functional. Let’s take a look at how to get it running on Python with an example of downloading PDF research papers.
First, you will need to import all dependencies as listed below:
import openai import wget import pathlib import pdfplumber import numpy as np
You will then need to install openai to interact with GPT-3, so make sure you have an API key. You can get one here.
You will also need wget to download PDFs from the internet. This will further require pdfplumber to convert it back to text. Install all three with pip:
pip install openai pip install wget pip install pdfplumber
To download the PDF and return its local path, enter the following:
def getPaper(paper_url, filename="random_paper.pdf"): """
Downloads a paper from the given url and returns the local path to that file.
""" downloadedPaper = wget.download(paper_url, filename) downloadedPaperFilePath = pathlib.Path(downloadedPaper) return downloadedPaperFilePath
Now, you need to convert the PDF into text so GPT-3 can read it:
paperFilePath = "random_paper.pdf" paperContent = pdfplumber.open(paperFilePath).pages def displayPaperContent(paperContent, page_start=0, page_end=5): for page in paperContent[page_start:page_end]: print(page.extract_text()) displayPaperContent(paperContent)
Now that you have the text, it’s time to start summarizing it:
def showPaperSummary(paperContent): tldr_tag = "\n tl;dr:" openai.organization = 'organization key' openai.api_key = "your api key" engine_list = openai.Engine.list()
available from the openai API
Here, we are letting the GPT-3 model know that we require a summary. Then, we proceed to set up the environment to use the openai API.
for page in paperContent: text = page.extract_text() + tldr_tag response = openai.Completion.create(engine="davinci",prompt=text,temperature=0.3, max_tokens=140, top_p=1, frequency_penalty=0, presence_penalty=0, stop=["\n"] ) print(response["choices"][0]["text"])
This code extracts the text from each page, feeds the GPT-3 model the max tokens for each page, and prints it to the terminal.
Now that everything is set up, we can run the summarizer:
paperContent = pdfplumber.open(paperFilePath).pages showPaperSummary(paperContent)
Text summarization is very useful for people dealing with large amounts of written data on a daily basis, such as online magazines, research sites, and even for teachers in schools.
While there are simple methods of text summarization in Python such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers such as T5 and GPT-3.
Which technique to choose really comes down to preference and the use-case for each of these summarizers. But in theory, AI-based summarizers will prove better in the long run as they will constantly learn and provide superior results.