Key Takeaways From a Guide to NLP

Dev Aggarwal
13 min readOct 30, 2022

--

Although I have a little bit of knowledge of NLP techniques through reading articles and such, it’s nowhere near as good as the real thing. In this blog post, I will be sharing what I learned as I completed a comprehensive guide (at the bottom) to Natural Language Processing. I will go into the coding aspects of various NLP techniques as well as a few practical examples I was guided through such as text summarization, text generation (autocompletion), and autonomous questions and answers system. It is important to note that NLP projects implement multiple techniques cohesively in order to enhance the goal of the program.

Source: Steadforce

The first step is to import libraries in order to run NLP tasks. There are many NLP libraries dedicated to Python, so we will be using the Python libraries below for this blog.

Go to terminal and run these commands to install the libraries onto the computer:

~ pip install nltk
~ pip install spacy
~ pip install transformers

Open a python file, preferrably a Jupyter Notebook from VSCode and let’s get started with the actual coding.

# nltk - natural language tool kitimport nltk
nltk.download('punkt) # tokenizer
import spacy # one of the most popular and advanced NLP libraries
# spaCy is said to have been better than nltk and my personal # experience complements this but both will be gone over for greater # insight
nlp = spacy.load('en_core_web_sm')

Preprocessing

This is raw data:

raw_text = "Hey! What's up? It's been a while, hasn't it?"

There’s not a whole lot that can be done in this form, but through extensive manipulation which is taken care of by these libraries and its methods we can perform meaningful NLP tasks.

The first part of this extensive manipulation is preprocessing, part of which encompasses removing noise from raw text. Noise is unwanted data that does not contribute to the overall meaning of the sentence. A way to think about unwanted noise is the confusion of a machine by meaningless words and punctuation that are in no way essential to its overall meaning. sentence. It’s important that noise is removed for AI programs to be efficient and remain focused on what is important.

We will explore the various methods of accomplishing this goal.

The string raw_text first needs to be converted into an object that is manipulatable, which is seen below.

text_doc = nlp(raw_text) # many useful features in text_doc

Tokenization

The process of splitting the sentence up into multiple tokens or smaller units/parts.

The tokens have already been split by the conversion of raw_text into a text_doc object. Print them out by:

for token in text_doc:
print(token.text)

The tokens in text_doc are labelled in a way that is very useful for gathering information. It records whether the token is punctuation, alphabetical, stop words, or spaces using the token.is_punct , token.is_alpha , token.is_stop , and token.is_space , respectively. Stop words are as previously mentioned, “meaningless words”.

Find spaCy’s list of stop words:

stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)
# printing a fraction of the list through indexingfor word in list_stopwords[:7]:
print(word)

Check out the labelling of raw_text

print("Text".ljust(10), ' ', "Alpha", "Space", "Stop", "Punct")
print('-'*35)
token_count = 0
for token in text_doc:
print(token.text.ljust(10), ':',token.is_alpha, token.is_space,
token.is_stop, token.is_punct)
token_count++
print(token_count)

These labelling cans be used to filter out stop words that do not give information and skew NLP models towards unimportant text.

token_count_without_stopwords_or_punct = 0filtered_text = [token for token in text_doc if not token.is_stop and not token.is_punct]for token in filtered_text:
print(token)
token_count_without_stopwords_or_punct+=1
print(token_count_without_stopwords_or_punct)

Notice: token_count_without_stopwords_or_punct is less than token_count from the previous section of code

Stemming

Stemming can be used to eliminate redundancy of words with similar forms by tracing words back to their roots.

from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ['dance', 'dances', 'dancing', 'danced']
for word in words:
print(word.ljust(8),'---------', ps.stem(word))

Lemmatization

Similar to stemming in some ways, but different in others:

  • Considers context and converts into meaningful base form. For example, stemming converts dancing → danc while lemmatization converts dancing → dance. Simple but powerful difference
  • More computationally expensive
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("dances"))

Now we will compare nltk and spaCy

# nltk
print('NLTK:')
words = ['dance', 'dances', 'dancing', 'danced']
lemmatizer = WordNetLemmatizer()
for word in words:
print(word, '------',lemmatizer.lemmatize(word))
# spaCy
print('spaCy:')
text = 'dance dances dancing danced'
text_doc = nlp(text)
for token in text_doc:
print(token.text, '------', token.lemma_)
------------------------------RESULTS-------------------------------NLTK: dance ------ dance
dances ------ dance
dancing ------ dancing
danced ------ danced
spaCy: dance ------ dance
dances ------ dance
dancing ------ dance
danced ------ dance

Clearly, spaCy’s lemmatization techniques are better than those of the NLTK library, although a more hands-on, independently-coded approach to lemmatization could also have significant results.

Word Frequency Analysis

Tokens can also be analyzed for their frequency. Word Frequency Analysis can come in handy because the frequency of a word could indicate greater meaning or purpose. For humans, identifying the meaning is simple. In the end however, a computer isn’t a real life, easily understanding organism which is why we use whatever possible to make sure the computer finds an overall theme or meaning through some kind of algorithmic approach.

import collections
from collections import Counter
data = 'It is my birthday today. I could not have a birthday party. I felt sad'data_doc = nlp(data)list_of_tokens = [token.text for token in data_doc if not token.is_stop and not token.is_punct]token_frequency = Counter(list_of_tokens)
print(token_frequency)

token_frequency indicates ‘birthday’ is the word used most frequently which is correct. However, it is important to note that this was after stop words and other noise was removed. If they were not filtered out, in most cases the word frequency analysis will not yield meaningful results.

Parts of Speech Tagging

Parts of speech tagging is exactly what it sounds like. If you’ve ever encountered extensive grammar lessons and tests in an English class then you might even pity the machine 😅.

For grammar classification of a token, token.pos_ is used:

text = 'My sister was screaming out loud as usual. I kept ignoring her'
text_doc = nlp(text)
for token in text_doc:
print(token.text.ljust(10), '----', token.pos_)
------------------------------RESULTS-------------------------------My ---- PRON
sister ---- NOUN
was ---- AUX
screaming ---- VERB
out ---- ADV
loud ---- ADJ
as ---- ADP
usual ---- ADJ
. ---- PUNCT
I ---- PRON
kept ---- VERB
ignoring ---- VERB
her ---- PRON

The results are interesting. I wonder how NLP labels parts of speech? Perhaps it looks at the relationship between words to find the word’s function? Or maybe it cross references each word against a complete dictionary that has a part of speech category label for every word. I am willing to bet that an API exists for the latter, although I don’t know if this is the technique used. My doubts stems primarily from text_doc = nlp(text) in that I am astonished that raw data is processed and labelled in so many ways so fast!

I will not show the code for filtering out certain parts of speech, but all it takes it looping through every token. If word.pos_ in junk_pos resorts to false where junk_pos is a list of excluded grammar types, then the word can be added to a new revised_list .

A picture is worth a thousand words

And so a visual can be extremely helpful in understanding the relationship between words. For this, there is a method in the spaCy library called displacy .

from spacy import displacy# set jupyter = False if not using jupyterdisplacy.render(doc, style = 'dep', jupyter = True)
Source: Me :)
Source: Also me :)

Dependency Parsing

Dependency parsing is similar to part of speech tagging except the relationship between words is analyzed. Also, dependency parsing identifies an independent word called the “head word” which other words in the sentence depend upon.

my_text = 'Ardra fell into a well and fractured her leg'
my_doc = nlp(my_text)
for token in my_doc:
print(token.text, '---', token.dep_)
# can also use displacy to render a visual
Source: Me

Named Entity Recognition

NER identifies named entities such as people, locations, organizations. This can be used to scan large text documents for entities that could be influential. Also used in classification problems and search engines.

sentence = 'The building is located at London. It is the headquarters of Yahoo. John works there. He speaks English'doc = nlp(sentence)for entity in doc.ents: #ents short for entities
print(entity.text, '--', entity.label_)
------------------------------RESULTS-------------------------------London -- GPE
English -- LANGUAGE

GPE stands for any political borders/territories including countries, states, etc. So “London” and “English” are correct, although I would have expected the model to also identify John and Yahoo? A little bit disappointing but I can’t expect perfection in all cases while running methods from libraries. Still, how come “John” and “English” not recognized?

Text Summarization — Task #1

Now that we’ve gone over preprocessing, let’s get into text summarization. It can be very useful for finding what a long article or document is, especially when it may take a long time to read or there are several to choose from

Start by choosing a large text document like a comprehensive news article or blog post.

import spacy
nlp = spacy.load('en_core_web_sm)
article_text = # copy pasted text
# I used The Power of Natural Language Processing
#
https://hbr.org/2022/04/the-power-of-natural-language-processing
doc = nlp(article_text)

In order to remove noise, we can use parts of speech tagging to selectively store keywords in a list.

from string import punctuationkeywords_list=[]# according to an article I read nouns, pronouns, and verbs
# generally add the most value to a text
# POS categories that are important

desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
if(token.text in nlp.Defaults.stop_words or token.text in
punctuation):
continue
if(token.pos_ in desired_pos):
keywords_list.append(token.text)

Now we can conduct word frequent analysis to find the words that are used the most, which will be used to indicate importance or significance to the article. Then we apply the normalization formula to standardize the frequency in a range of 0 to 1.

from collections import Counter# creating dictionary of keywords + frequency
dictionary = Counter(keywords_list)
# finding highest frequency which is used to calculate a statistic
highest_frequency = Counter(keywords_list).most_common(1)[0][1]
# normalization formula/process for all keyword frequencies
for word in dictionary:
dictionary[word] = (dictionary[word]/highest_frequency)
print(dictionary)

Now that we have a value for every word in the article, we can assign a total score to each sentence.

score={}# Iterationfor sentence in doc.sents: # for every sentence in doc
for token in sentence: # for every word in sentence
# if token is keyword, add the frequency of keyword to the
# score dictionary
if token.text in dictionary.keys():
# if sentence is in {score}, add the value to it
if sentence in score.keys():
score[sentence] += dictionary[token.text]
# otherwise sentence is not in {score} so add it
else:
score[sentence] = dictionary[token.text]
print(score)

Final steps are to select a limited number of sentences from a sorted list of sentences in order of significance indicated by scores.

sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)text_summary = []num_of_sentences=4for i in range(0, 4):
text_summary.append(str(sorted_score[i][0]).capitalize())
print(text_summary)------------------------------RESULTS-------------------------------['Language models are already reshaping traditional text analytics, but gpt-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems.', 'Powerful generalizable language-based ai tools like elicit are here, and they are just the tip of the iceberg; multimodal foundation model-based tools are poised to transform business in ways that are still difficult to predict.', 'Remember that while current ai might not be poised to replace managers, managers who understand ai are poised to replace managers who don’t.\n\ndo not underestimate the transformative potential of ai.\n', 'Nlp practitioners call tools like this “language models,” and they can be used for simple analytics tasks, such as classifying documents and analyzing the sentiment in blocks of text, as well as more advanced tasks, such as answering questions and summarizing reports.']

The results were not bad, not bad at all! There is certainly room for improvement, but keep in mind that is a very basic NLP text summarization program.

Text Generation — Task #2

Provide a sentence or phrase than an NLP model can expand on with relevant context.

For text generation, we will be using a model called GPT2 which requires certain libraries. Run the commands below in the terminal.

~ pip3 install torch torchvision torchaudio
~ pip install transformers

Getting started with the actual code, start by importing:

import torch
from transformers import GPT2Tokenizer
from transformers import GPT2DoubleHeadsModel
# initialize
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model=GPT2DoubleHeadsModel.from_pretrained('gpt2-medium')

Store the start of any sentence you want the NLP model to complete or talk about in a variable. The string must be encoded into a sequence of ids.

# My sentence is ...
my_text = "It is a bright"
ids = tokenizer.encode(my_text)
print(ids)

Next, we must get the tensor of the input ids. Tensor is a container which can house data in N dimensions. a type of data structure used in linear algebra that can be used for arithmetic operations like matrices and vectors.

my_tensor=torch.tensor([ids])# set the model to evaluation mode
mode.eval()
result = model(my_tensor)
predictions = result[0]

predictions has stored the tensor with all probable ids of next words of the text. The torch.argmax() method returns the indices of the maximum value of all elements in the input tensor. Passing the predictions tensor as input to torch.argmax will return a value will give us the ids of the next words. This “value” can be decoded by the tokenizer just as it was encoded by the tokenizer at the start of the program.

predicted_index = torch.argmax(predictions[0, -1, :]).item()# ids is the input which is included to show the entire sentence
predicted_text=tokenizer.decode(ids+[predicted_index])
print(predicted_text)------------------------------RESULTS-------------------------------'It is a bright day'

Here is an interesting program that I also found from the source mentioned below that effectively highlights steps essential to text generation process.

# generate required number of words by a loop
num_words_to_generate = 30
# original text
text = 'It is a sunny'
# looping for each word to be generated
for i in range(num_words_to_generate):
# encode the input text
input_ids = tokenizer.encode(text)
# convert into a tensor
input_tensor = torch.tensor([input_ids])
# pass input tensor to model
result=model(input_tensor)
# storing all predicted ids of next word
predictions = result[0]
# Choosing the predicted_index (original maximum value)
predicted_index = torch.argmax(predictions[0, -1, :]).item()
# decoding the predicted index to text and concatenating to
original text
text = tokenizer.decode(input_ids + [predicted_index])
print(str(text))------------------------------RESULTS-------------------------------It is a sunny day in the city of San Francisco, and the sun is shining. The city is filled with people, and the people are enjoying themselves. The sun

This was pretty awesome, I am really surprised that it somehow was able to complete the phrases. I don’t really know what’s going on behind the curtains though, the only possible way I could imagine creating remotely similar results is by finding the instances of phrases such as “It is a sunny [BLANK]” in extremely large amounts of data and returning the word that is used most often, “day”. However, I doubt this is the case because tensors are used which have something more to do with math and matrices and vectors and such.

Question-Answering — Task #3

This is a task in which a question is asked by a user and the NLP model itself answers the question. No humans involved, except in the coding process of course!

First, import the necessary libraries:

from transformers import pipeline
faq_machine = pipeline(task='question-answering')

Next, it is essential to understand that a question is not the only input. A machine without any data is an empty and useless machine. How can a machine answer a question when it has no information to respond to the question. This is why the 2nd input is context, which must provided to the question-answering program ahead of time. Due to the circumstances, questions must be relevant to the context as well. Otherwise, the QA program will fail.

about_nlp = # copy pasted informative text
# I used the answer to the question "What is natural language processing?" on the official IBM site
#
https://www.ibm.com/cloud/learn/natural-language-processing
faq_machine(question='what is natural language processing?', context=about_nlp)------------------------------RESULTS-------------------------------{'score': 0.06833501160144806,
'start': 142,
'end': 221,
'answer': 'concerned with giving computers the ability to understand text and spoken words'}

Once again, I am shocked, amazed, left in awe, and everything else possible at the results. It made sure to not include “natural language processing” itself in its definition and also used the exact phrase within the text that defined NLP the best and even made sure to cutout some parts at the end that are unnecessary.

I have a few more test cases I decided to examine.

faq_machine(question=' what is an apple?', context=about_junkfood)------------------------------RESULTS-------------------------------
{'score': 0.012518931180238724,
'start': 0,
'end': 27,
'answer': 'Natural language processing'}

This of course confirms that the QA program will not be able to provide the definition of information it was not previously given. Also, despite there be nothing in common it defaulted towards the answer “natural language processing”, perhaps because it determined the main subject of the text data through word frequency analysis. However, the phrase “NLP” was used for most of the rest of the passage which would counter the previous statement so maybe the QA program equates “natural language processing” with“NLP” because the latter was in parentheses of the former in the beginning.

Test case #3
faq_machine(question=' what is computatioonal linguistics?', context=about_junkfood)
------------------------------RESULTS-------------------------------
{'score': 0.7846049666404724,
'start': 301,
'end': 338,
'answer': 'rule-based modeling of human language'}
--------------------------------CODE--------------------------------Test case #4
faq_machine(question=' what are applications of nlp', context=about_junkfood)
------------------------------RESULTS-------------------------------
{'score': 0.029459049925208092,
'start': 923,
'end': 981,
'answer': 'customer service chatbots, and other consumer conveniences'}

If you look closely at the word “computatioonal” in test case #3, the results are still accurate. Apparently misspellings, or at least minor errors, do not hinder the NLP model. Test case #4 is be harder because the word “applications” was never mentioned in the context provided to the program, although the applications themselves are present within the data. There were a lot more examples, so I am not completely satisfied with the answer although the phrase “other consumer conveniences” is not wrong, either.

All in all, I enjoyed learning the basics of NLP. There’s a bunch out there just waiting to be discovered, and my desire to learn more has only strengthened with the little I’ve managed to piece together so far.

Sources:

[CODE TAKEN]

--

--

Dev Aggarwal
Dev Aggarwal

Written by Dev Aggarwal

Tennis player, bookworm, programmer that can't wait to learn more, do more