Author : Kukkala naga vamsi manikanta reddy
Digitalization has changed the way we process and analyze information. The amount of data is increasing day by day through journals, e-books, newspapers, webpages to emails, and so on. This is the situation where Text classification plays a key role. Text classification is a smart classification of text into categories like danger, spam, dislike, sports, politics, and so on. Text classification is one of the fundamental tasks in natural language processing with broad applications such as sentiment analysis, topic labelling, spam detection, and intent detection.
WHY IS TEXT CLASSIFICATION IMPORATANT?
It’s estimated that around 80% of all information is unstructured like videos, audios, images, and so on with text being one of the most common types of unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data is hard and time-consuming.
This is where text classification with machine learning comes in. Using text classifiers, we can automatically structure all manner of relevant text, from emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. This allows us to save time analyzing text data, automate business processes.
![](https://static.wixstatic.com/media/6e3b57_07b55179d7c44fb0af50246d7a2c299d~mv2.jpg/v1/fill/w_700,h_350,al_c,q_80,enc_auto/6e3b57_07b55179d7c44fb0af50246d7a2c299d~mv2.jpg)
Figure-1
![](https://static.wixstatic.com/media/6e3b57_e6284484d48548d296513dbf42711eec~mv2.jpg/v1/fill/w_980,h_490,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_e6284484d48548d296513dbf42711eec~mv2.jpg)
Figure-2
WHY IS MACHINE LEARNING TEXT CLASSIFICATION IS IMPORTANT?
SCALABILITY: When we analyze the data manually it consumes a lot of time. Machine learning algorithm can automatically go through a millions of reviews, comments, emails within a fraction of seconds.
REAL TIME ANALYSIS: When we got a situation to take immediate decision machine learning helps us to take decisions quickly.
WORKING PROCESS OF TEXT CLASSIFICATION:
We have 3 types of text classification:
Rule based.
Machine learning based.
Hybrid system.
RULE BASED:
This approach classifies the text into categories based on the rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content.
Example: We have to classify the article (sports or politics) based on words in it. Words related to sports such as football, basketball, LeBron James, etc., and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.).
This rule-based system will classify the headline “When is LeBron James' first game with the Lakers?” as Sports because it counted one sports-related term (LeBron James) and it didn’t count any politics-related terms. This approach has few disadvantage of TIME CONSUMING, since generating rules for complex system is quiet challenging and requires a lot of analysis and time.
MACHINE LEARNING BASED:
The first step towards training a machine learning NLP classifier is feature extraction: a method is used to transform each “text into a numerical representation” in the form of a vector. One of the most frequently used approaches is “bag of words”, where a vector represents the frequency of a word in a predefined dictionary of words.
Bag of words: A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
A vocabulary of known words.
A measure of the presence of known words.
Step 1: Collect Data:
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.
Step 2: Design the Vocabulary:
Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are: “it”, “was”, “the”, “best”, “of”, “times”, “worst”, “age”, “wisdom”, “foolishness”
That is a vocabulary of 10 words from a corpus containing 24 words.
Step 3: Create Document Vectors:
The next step is to score the words in each document.
The simplest scoring method is to mark the presence of words as a Boolean value, 0 for absent, 1 for present.
The scoring of the document would look as follows:
“it”=1, “was”=1, “the”=1, “best”=1, “of”=1, “times”=1, “worst”=0, “age”=0, “wisdom”=0, “foolishness”=0
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Step4: Managing Vocabulary:
There are simple text cleaning techniques that can be used as a first step, such as:
Ignoring case.
Ignoring punctuation.
Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
Fixing misspelled words.
Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.
Step5: Scoring Words:
Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.
In the worked example, we have already seen one very simple approach to scoring: a binary scoring of the presence or absence of words.
Some additional simple scoring methods include:
Counts: Count the number of times each word appears in a document.
Frequencies: Calculate the frequency that each word appears in a document out of all the words in the document.
SOME MACHINE LEARNING CLASSIFICATION ALGORITHMS:
Logistic Regression.
Naive Bayes.
K-nearest Neighbors.
LOGISTIC REGRESSION:
It is a classification used to predict the binary outcome, this ca be exhibited as YES/NO, PASS/FAIL, etc.
We use independent variables to predict the outcome.
P(Y=1/X).
Y is dependent variable, X is independent variable.
It supports categorizing data into discrete classes by studying the relationship from a given set of labelled data. It learns a linear relationship from the given dataset and then introduces a non-linearity in the form of the Sigmoid function.
What is the Sigmoid Function?
It is a mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a logistic function.
![](https://static.wixstatic.com/media/6e3b57_c1374a6264ab4ddeb725ca3e267153b6~mv2.jpg/v1/fill/w_485,h_323,al_c,q_80,enc_auto/6e3b57_c1374a6264ab4ddeb725ca3e267153b6~mv2.jpg)
Figure-3
Pros:
Logistic regression is easier to implement, interpret, and very efficient to train.
It makes no assumptions about distributions of classes in feature space.
Cons:
If the number of observations is lesser than the number of features, Logistic Regression should not be used, otherwise, it may lead to overfitting.
It constructs linear boundaries.
NAIVE BAYES:
It predicts whether the data belongs to certain category or not.
EX:
![](https://static.wixstatic.com/media/6e3b57_512e915bbd5b4a1984326114f3aa9d41~mv2.jpg/v1/fill/w_770,h_278,al_c,q_80,enc_auto/6e3b57_512e915bbd5b4a1984326114f3aa9d41~mv2.jpg)
Pros:
This algorithm works very fast and can easily predict the class of a test dataset.
You can use it to solve multi-class prediction problems as it’s quite useful with them.
Cons:
This algorithm is also notorious as a lousy estimator. So, you shouldn’t take the probability outputs too seriously.
It assumes that all the features are independent. While it might sound great in theory, in real life, you’ll hardly find a set of independent features.
K - NEAREST NEIGHBORS:
KNN Algorithm assumes similar things are close to each other.
It is used regression as well as classification but mostly used for classification.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.
![](https://static.wixstatic.com/media/6e3b57_1e95e1837e8345e4bc67148aeef823c9~mv2.jpg/v1/fill/w_886,h_440,al_c,q_85,enc_auto/6e3b57_1e95e1837e8345e4bc67148aeef823c9~mv2.jpg)
Figure-4
Pros:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Cons:
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points for all the training samples.
HYBRID SYSTEM:
Hybrid systems is a combination of machine learning-trained base classifier with a rule-based system, used to improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been correctly modeled by the base classifier.
CONCLUSION:
Text classification is a fundamental machine learning problem with applications across various products. In this project, we have broken down the text classification workflow into several steps. For each step, we have suggested a customized approach
GITHUB:
Check the above link for codes.
Commentaires