NLP FOR LOG ANALYSIS AND LOG MINING

Sep 19, 20218 min read

Author - Rutuja Shinde

This article focusses on NLP for Log Analysis and Log Mining. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of computers to understand human language.

What is Log?

A collection of messages from different network devices and hardware in time sequence represents a log. Logs may be directed to files present on hard disks or can be sent over the network as a stream of messages to log collector. Logs provide the process to maintain and track the hardware performance, parameters tuning, emergency and recovery of systems and optimization of applications and infrastructure.

What is Log Analysis?

Log analysis is the process of reviewing, interpreting and understand computer-generated records called logs.

Log analysis is the process of extracting information from logs considering the different syntax and semantics of messages in the log files and interpreting the context with application to have a comparative analysis of log files coming from various sources for Anomaly Detection and finding correlations.

What is Log Mining?

Log Mining is also known as Log Knowledge Discovery. It is the process of extracting patterns and correlations in logs to reveal knowledge and predict Anomaly Detection if any inside log messages.

Role of NLP in Log Analysis & Log Mining

Natural Language processing techniques are widely used in Log Analysis and Log Mining. The different techniques such as Tokenization, Stemming, Lemmatization, Parsing, etc. are used to convert log messages into structured form. Once logs are available in the well-documented form, log analysis, and log mining is performed to extract useful information and knowledge is discovered from the information. The example in case of error log caused due to server failure.

Log Analysis Functions and Methods

Log analysis functions manipulate data to help users organize and extract information from the logs. Here are just a few of the most common methodologies for log analysis.

Normalization - Normalization is a data management technique wherein parts of a message are converted to the same format. The process of centralizing and indexing log data should include a normalization step where attributes from log entries across applications are standardized and expressed in the same format.
Pattern Recognition - Machine learning applications can now be implemented with log analysis software to compare incoming messages with a pattern book and distinguish between "interesting" and "uninteresting" log messages. Such a system might discard routine log entries, but send an alert when an abnormal entry is detected.
Classification and Tagging - As part of our log analysis, we may want to group together log entries that are the same type. We may want to track all of the errors of a certain type across applications, or we may want to filter the data in different ways.
Correlation Analysis - when an event happens, it is likely to be reflected in logs from several different sources. Correlation analysis is the analytical process of gathering log information from a variety of systems and discovering the log entries from each individual system that connect to the known event

Natural Language Processing Techniques

Different methods used for performing log analysis are described below

1. Pattern Recognition- It is one such technique which involves comparing log messages with messages stored in pattern book to filter out messages.

2. Text Normalization- Normalization of log messages is done to convert different messages into the same format. This is done when different log messages have different terminology, but the same interpretation is coming from various sources like applications or operating systems.

3. Automated Text Classification & Tagging- Classification & Tagging of different log messages involves ordering of messages and tagging them with the various keywords for later analysis.

4. Artificial Ignorance- It is a kind of technique using Machine Learning Algorithms to discard uninteresting log messages. It is also used to detect an Anomaly in the ordinary working of systems.

Diving into Natural Language Processing Applications

Natural language processing is a complex field and is the intersection of Artificial Intelligence, computational linguistics, and computer science.

Getting started with Natural Language Processing

1. Tokenization and Sentence segmentation -

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Sentence segmentation identifies sentence boundaries in the given text, i.e., where one sentence ends and where another sentence begins. Sentences are often marked ended with punctuation mark ‘.’

We Use NLTK Library of Python : Tokenization using NLTK

Word tokenize:

We use word_tokenize() method to split a sentence into tokens or words.

Sentence tokenize:

We use the sent_tokenize() method to split a document or paragraph into sentences.

2. Text Stemming and Lemamatization :

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a base form. Stemming and lemmatization are special cases of normalization. However, they are different from each other.

Stemming usually refers to a crude heuristic process that chops off the ends of words and often includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma common base form.

3. Part of speech (POS) tagging -

Parts of speech tags are the properties of the words, which define their main context, functions, and usage in a sentence. Some of the commonly used parts of speech tags are

Nouns: Which defines any object or entity
Verbs: That defines some active verbs

fig: 1

Adjectives and Adverbs: This acts as a modifier, quantifier, or intensifier in any sentence.

Figure 1 shows that Part of Speech all tags.

Parts of speech tags have a large number of applications and they are used in a variety of tasks such as:

Text Cleaning
Feature Engineering tasks
Word sense disambiguation

4. Parsing -

The word ‘Parsing’ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to draw exact meaning or dictionary meaning from the text. It is also called Syntactic analysis or syntax analysis. Comparing the rules of formal grammar, syntax analysis checks the text for meaningfulness. The sentence like “Give me hot ice-cream”, for example, would be rejected by parser or syntactic analyzer.

Then, the outcome of the parsing process would be a parse tree like the following, where sentence is the root, intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.

Parse Tree:

fig: 2

Figure 2 shows how the Parse Tree nodes from parent node to child node.

Regexp parser

Regexp parsing is one of the mostly used parsing technique. As the name implies, it uses a regular expression defined in the form of grammar on top of a POS-tagged string.

It basically uses these regular expressions to parse the input sentences and generate a parse tree out of this.

feature extraction techniques:

Machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of features.

5. Bag-of-Words -

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.By using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.

BoW is all about creating a matrix of words where the words (terms) are represented as the rows and the columns represent the document names. Matrix with the frequency of each term within the document, ignoring the grammar and order of the terms.

The matrix is referred to as the Term Document Matrix (TDM).Each row is a word vector. Each column is a document vector.

We can use the CountVectorizer() function from the Sk-learn library to easily implement the above BoW model using Python.quivalent vector of numbers.

6. N-Grams -

N-Grams is an important concept to understand in text analytics. Essentially, N-Grams is a set of 1 or more consecutive sequence of items that occur next to each other. As mentioned above, N is a numerical value that implies the n items of sequence of text.

When we type text in a search engine, we can see the probabilistic model of the search engine starts predicting the next set of words based on the context. This is known as the Autocomplete feature of the search engines. N-Grams allows us to build this text mining forecasting model. N-Grams allows us to predict the next words of a text

Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into:

grams: Absolutely, wonderful, silky, and, sexy, and, comfortable
grams: Absolutely wonderful, wonderful silky, silky and, and sexy, sexy and, and comfortable
grams: Absolutely wonderful silky, wonderful silky and , silky and sexy and , and sexy and , sexy and comfortable

N-grams can be more informative than bag of words because they capture more context around each word (i.e. “love this flower” is more informative than just “dress”). Typically, 3-grams is about as high as we want to go as using higher n-grams beyond that rarely increases performance because of sparsity.

N-grams are used for a variety of things. Some examples include auto completion of sentences (such as the one we see in Gmail these days), auto spell check (yes, we can do that as well), and to a certain extent, we can check for grammar in a given sentence.

7. Term Frequency-Inverse Document Frequency (TF-IDF) -

TF-IDF is a great statistical measure. It helps us understand the relevance of the term (word). For each term in a document, a matrix is computed by performing following 3 steps:

1. Calculate the frequency of a term in a document. This is known as Term Frequency (TF). This is achieved by dividing the number of times a term appears in a document divided by the total number of terms in a document.

2. Calculate the inverse of document frequency of a term. This is computed by dividing the total number of documents by the number of documents that contain the term. The inverse is calculated so that we can compute a positive log value. Therefore, compute a log of the computed value. This will result in a positive value. This is known as Inverse Document Frequency (IDF).

3. Finally multiply step 1 by step 3. This is known as TF-IDF

Rows of the matrix represent the terms and the columns of the matrix are the document names.

fig: 3

Figure 3 shows the steps Graphical representation of steps involved in calculating Term Frequency-Inverse Document Frequency.

Let’s do some coding.

Refer below link for coding purpose:

https://github.com/Ruz123/NLP-for-Log-Analysis-and-Log-Mining/blob/main/NLP%20Log%20Analysis%20and%20Log%20Mining%20(1).ipynb

Key Application Areas of Natural Language Processing

Apart from use in Big Data, Log Mining, and Log Analysis, it has other significant application areas. Although the term ‘NLP’ , we are using NLP every day.

Automatic Text Summarizer

Given the input text, the task is to write a summary of text discarding irrelevant points.

Sentiment-based Text Analysis

It is done on the given text to predict the subject of the text, eg, whether the text conveys judgment, opinion or reviews, etc.

Text Classification

It is performed to categorize different journals, news stories according to their domain. Multi-document classification is also possible. A famous example of text classification is spam detection in emails. Based on the style of writing in the journal, its attribute can be used to detect its author's name.

Information Extraction

Information extraction is something which proposes email program add events to the calendar automatically.

References:

http://text-processing.com/demo/tokenize/
https://en.wikipedia.org/wiki/Log_analysis
https://streamhacker.com/2018/11/13/nlp-log-analysis-tokenization/
https://www.sumologic.com/glossary/log-analysis/
https://www.xenonstack.com/blog/natural-language-processing
https://aiaspirant.com/bag-of-words/

Madras Scientific Research Foundation

NLP FOR LOG ANALYSIS AND LOG MINING

Let’s do some coding.

Refer below link for coding purpose:

References:

Recent Posts

Comments