STEMMING AND LEMMATIZATION.

Sep 19, 20213 min read

Author - PRAJAY MEHTA.

Usually a word could have various meanings based on its context and its use in the speech. For example, when we search anything on google like “playing” it shows us all the results that include “play” in it, whether it be the definition of play or google play or a kid’s playground etc. This is possible because the google search engine finds the root word from the inflected word(we will learn what it is further). This is where Stemming and Lemmatization are used.

STEMMING AND LEMMATIZATION:

Stemming and Lemmatization are the methods used for Text Normalization in Natural Language Processing(NLP). In layman’s terms NLP can be defined as the technology used by machines to analyze and interpret human language.

Fig-1 NLP.

These processes are an essential part of the NLP pipeline. The pipeline contains various functions, each of them are important in order to read, decipher and understand human language. Following is the NLP pipeline:

Fig-2 NLP pipeline.

Before moving on to stemming and lemmatization there are various things we need to understand about a language.

Language we speak in our everyday life is made up of several words which are often derived from one another. When a language contains words that are derived from another word as their use in the speech changes, that language is known as an inflected language.

Fig-3 Text Normalization.

As we can see from the above image, the words “playing, plays, played” are derived from the same root word play. This shows inflection in a language. This is the reason why Stemming and Lemmatization are used in NLP i.e., to get the root word with proper meaning. Thus, Stemming and Lemmatization are used to derive the root word from the inflected words in order to achieve Text Normalization.

Stemming and Lemmatization do the same thing but differ in the procedure, how they complete the task and the results. Generally Lemmatization produces more meaningful results.

STEMMING:-

Stemming is the process of reducing or removing inflection in the words in order to retrieve the root word known as stem. Stemming often results in words that may not have actual meaning because it just removes any suffixes or prefixes added to the word.

For implementing Stemming and Lemmatization we need to import nltk (Natural Language ToolKit) python library as it provides many required functions such as stemming, tokenizing, POS(Parts Of Speech) tagging, parsing etc. and is user friendly.

Generally PorterStemmer or LancasterStemmer is used for English Language and the algorithm changes with change in language.

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
porter = PorterStemmer()
lancaster=LancasterStemmer()
print("----------------------------------------------------")
print("Porter Stemmer")
print(porter.stem("running"))
print(porter.stem("playing"))
print(porter.stem("friendship"))
print(porter.stem("troubling"))
print("----------------------------------------------------")
print("Lancaster Stemmer")
print(lancaster.stem("running"))
print(lancaster.stem("playing"))
print(lancaster.stem("friendship"))
print(lancaster.stem("troubling"))

OUTPUT 1.

Porter Stemming removes the suffix from the word to produce the lemma. In the above example we can see the result for troubling. The actual output should be “trouble” but that is not the case. It is because the Porter Stemming algorithm does not follow linguistics. This is the reason why many times the result of stemming does not produce a word with proper meaning.

The Lancaster stemming algorithm is iterative in nature which saves the rules after executing them. During each iteration, it tries to find an applicable rule which tells it to delete or replace the suffix of the word, as you can see in the above example for the word “friendship”.

The nltk library provides many other Non-English stemmers such as SnowballStemmers, ISRIStemmers etc.

LEMMATIZATION:-

Lemmatization is a linguistic term that means grouping together words with the same root or lemma but with different inflections or derivatives of meaning so they can be analyzed as one item. The aim is to take away inflectional suffixes and prefixes to bring out the word’s dictionary form.

The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
wnl = WordNetLemmatizer()
sentence = "He was running and eating at the same time. He has a bad habit of swimming after playing for long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
   if word in punctuations:
       sentence_words.remove(word)
sentence_words
print("{0:24}{1:24}".format("Word","Lemma"))
for word in sentence_words:
   print ("{0:24}{1:24}".format(word,wnl.lemmatize(word, pos="v")))

OUTPUT 2

DIFFERENCE BETWEEN STEMMING AND LEMMATIZATION.

In conclusion we have known what Stemming and Lemmatization is along with the difference in terms. There are many other applications of stemming and lemmatization like text categorization, clustering, spam detection etc. In order to implement all these a person must have a strong linguistic knowledge.

GitHub link :-

https://github.com/Prajay8/Stemming_Lemmatization

REFERENCES:

https://en.wikipedia.org/wiki/Stemming
https://text-processing.com/demo/stem/
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

Madras Scientific Research Foundation

STEMMING AND LEMMATIZATION.

GitHub link :-

REFERENCES:

Recent Posts

Comments