KEYWORD EXTRACTION

Sep 19, 20213 min read

Author : PRIYATI JAIN

INTRODUCTION

Keywords Extraction is the topic of machine leaning artificial intelligence (AI) with Natural Learning Processing(NLP) so that we can make machine to understand and analyze human language. It is a text analysis technique which help us out in extracting important words from the given document which ultimately help us to save our time by summarizing the content and jot down the main points of that given content.

We can understand it by an common example as when we buy some product from online apps, we go through the reviews and those reviews contain some of the extracted keywords which saves out time and also comforts us that whether the product is good or not.

FIG.1

Keyword extraction is very important in our life as about 80% of the data we go through or generate is unstructured and it is very difficult to go through and analyze that content .So keyword extraction helps us to process and understand the content in efficient manner and also help us to save time.

METHODS

Various methods to extract keywords are discussed below (we will talk only in technical terms):-

FIG.2

1) TF-IDF(TERM FREQUENCY INVERSE DOCUMENT FREQUENCY)

TF-IDF is an algorithm and is a method which help in measuring that how important the word is in the document by calculating that how rare or frequent the word is appearing in the entire text and it only picked the words which has some importance or meaning in the text. For example (is, a, the) these words are also been used regularly in the text by the individually do not have a significant meaning that is why they are not been extracted out. This is commonly used to analyzing customer feedback.

2) RAKE(RAPID AUTOMATIC KEYWORD EXTRACTION)

RAKE is a library and is a well known extraction method which uses a list of stop words and phrases delimiting to detect the most relevant words in a piece of text.

3) GENSIM

It is primarily developed for topic modeling. Over time, Gensim added other NLP tasks such as summarization, finding text similarity, etc. Here we will demonstrate the use of Genism for keyword extraction tasks.

>>> from gensim.summarization import keywords
>>> text = """spaCy   is an open-source software library for advanced natural language processing, 
written in the programming languages Python and Cython. The   library is published under the MIT license 
and its main developers are Matthew Honnibal and Ines Montani,   the founders of the software company Explosion."""
>>> print(keywords(text))
language
languages
software
company

Performance of Genism in extracting keywords is still not at the level of spacy and rake-nltk. There is room for improvement for Genism in the keyword extraction task.

METHOD BASED ON MACHINE LEARNING

Now, we will discuss that how we can extract the keywords by using machine learning. Since extracting keywords machine has to understand the content which is done by transforming data into vectors(a collection of number with encoded data).

4) CRF(CONDITIONAL RANDOM FIELD)

CRF is a statics approach that learns patterns by weighting different in a sequence of words present in a text. For evaluating the performance ROUGE(Recall Oriented Underlying For Gisting Evaluation) is a family of matrices that compare different parameters like overlapping of words between the source text and extracted text and the parameters may include length and no of sequences.

FIG 3- Process of the CRF-based Keyword Extraction (1) Preprocessing and features extraction The input is a document. Before CRF model training, we must transfer the document into the tagging sequences, i.e a bag of words or phrases of the document. For a new document, we conduct the sentence segment, POS tagging. Then, the feautures mentioned above are automatic extracted. The output is the feautures vectors, and each vector corresponds to a word or phrase.

CONCLUSION

The keyword extraction is an effective in topic modeling tasks. We can know a lot about our text data by only a few keywords. These keywords will help you to determine whether you want to read an article or not. Keyword extraction can help us to obtain the most important keywords from the given text without even by reading a single line and this will saves our lot of time. This is an excellent way to find what is relevant in large set of data.

GITHUB LINK

1) https://github.com/Priyati1/Keyword-extraction-through-TF-IDF

REFERENCES LINK

1) https://monkeylearn.com/keyword-extraction/

2)https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/

3) https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c

4)https://www.researchgate.net/publication/272372039_Keyword_and_Keyphrase_Extraction_Techniques_A_Literature_Review

Madras Scientific Research Foundation