Named Entity Recognition (NER)

Sep 19, 20214 min read

Author : Rahul Shelke

The term “Named Entity (NE)”, widely used in Information Extraction (IE), Question Answering (QA) or other Natural Language Processing (NLP) applications, was born in the Message Understanding Conferences (MUC) which influenced IE research in the U.S in the 1990’s. At that time, MUC focused on IE tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In the course of system development, people noticed that it is important to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Extracting these entities was recognized as one of the important sub-tasks of IE. As this task is relatively independent, it has been evaluated separately in several different languages.

[Figure no.1 Named Entity in real-life]

Content:

In this blog, you will discover what Named Entity Recognition is in NLP and how to perform it.

After completing this blog, you will know:

What is Named Entity?
Why it is important?
Applications of NER.
How to Perform Named Entity Recognition?

Let’s get started,

[Figure no.2 Named Entity Recognition in computer]

What is Named Entity?

Any word which represents a person, organization, location etc. is a Named Entity. Named entity recognition is a sub task of Information Extraction and is the process of identifying words which are named entities in a given text. It is also called entity identification or entity chunking.

Example:

“Nevertheless, on 26th September, 2001, Yahoo! stock closed at an all-time low of $8.11”

· Here named entities are 26th September 2001, Yahoo!, stock And $8.11

· Named entity recognition is the task of identifying such(26th September 2001,Yahoo!, stock And $8.11) words from the text

Why it is important?

In order to understand the meaning from a given text (for example a tweet or document). It is important to identify who did what it whom. Named entity recognition is the first task of identify the words which may represent the who, what and whom in the text. It helps in identifying the major entities the text is talking about. Any NLP task which involves automatically understanding text and acts based on it, needs Named Entity Recognition in its pipeline.

No algorithm can 100% identify all the entities correctly.

Application of Named Entity Recognition (NER):

Classifying content for news providers: Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery.
Biomedical data: NER is used extensively in biomedical data for gene identification, DNA identification, and also the identification of drug names and disease names.
Efficient Search Algorithms: If for every search query the algorithm ends up searching all the words in millions of articles, the process will take a lot of time. Instead, if Named Entity Recognition can be run once on all the articles and the relevant entities (tags) associated with each of those articles are stored separately this could speed up the search process considerably.
Powering Content Recommendations: One of the major uses cases of Named Entity Recognition involves automating the recommendation process. Recommendation systems dominate how we discover new content and ideas in today’s worlds. The example of Netflix shows that developing an effective recommendation system can work wonders for the fortunes of a media company by making their platforms more engaging and event addictive.

How to perform Named Entity Recognition?

Approaches to Solve:

Basic NLTK algorithm
Using Spacy

Note: - Whole code is executed in Jupyter Notebook

Basic NLTK approaches

#impoting libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import pandas as pd

#data
text = "Apple acquired Zoom in China on Wednesday 6th May 2020.\
This news has made Apple and Google stock jump by 5% on Dow Jones Index in the \
United States of America"

# **Basic Named Entity (NE) tagging using NLTK - Word based**

#tokenize to words
words = nltk.word_tokenize(text)
words

#Part of speech tagging
pos_tags = nltk.pos_tag(words)
pos_tags

#Check nltk help for description of tag
nltk.help.upenn_tagset('NN')

**ne_chunk**
**Binary == True**

chunks = nltk.ne_chunk(pos_tags, binary=True)
for chunk in chunks:
  print(chunk)

entities = []
labels = []
for chunk in chunks:
  if hasattr(chunk, 'label'):
    #print(chunk)
    entities.append(' '.join(c[0] for c in chunk))
    labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ['entities', 'labels']
entities_df

**en_Chunk**
**Binary == False**

chunks = nltk.ne_chunk(pos_tags)
for chunk in chunks:
  print(chunk)

entities = []
labels = []
for chunk in chunks:
  if hasattr(chunk, 'label'):
    #print(chunk)
    entities.append(' '.join(c[0] for c in chunk))
    labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ['entities', 'labels']
entities_df

# **Basic Named Entity (NE) tagging using NLTK - Sentence based**

entities = []
labels = []

sentence = nltk.sent_tokenize(text)
for sent in sentence:
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
    if hasattr(chunk, 'label'):
      #print(chunk)
      entities.append(' '.join(c[0] for c in chunk))
      labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ['entities', 'labels']
entities_df

Using Spacy

import pandas as pd
import spacy 
from spacy import displacy
#SpaCy 2.x brough significant speed and accuracy improvements
spacy.__version__

#Download spacy models
!python -m spacy download en_core_web_sm

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_md")
#nlp = spacy.load("en_core_web_lg")

doc = nlp(text)

entities = []
labels = []
position_start = []
position_end = []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)
    
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

df

spacy.explain("ORG")

GitHub Links:

https://github.com/Rahul404/MSRF_Blogs/blob/main/1.%20Named%20Entity%20Recognition%20(NER)/Basic_NLTK_approache.ipynb
https://github.com/Rahul404/MSRF_Blogs/blob/main/1.%20Named%20Entity%20Recognition%20(NER)/Using_Spacy.ipynb

Summary:

In this blog, you discovered how named entity recognition term came into picture, what is named entity , why it is important in domain of information extraction (IE) and what are the approaches available and how can we perform named entity recognition in Python with given GitHub code link.

References:

1. https://en.wikipedia.org/wiki/Named-entity_recognition

2.https://ccc.inaoep.mx/~villasen/index_archivos/cursoTATII/EntidadesNombradas/Sekine-%20NEsHistory04.pdf

Madras Scientific Research Foundation

Named Entity Recognition (NER)

Content:

GitHub Links:

Summary:

References:

Recent Posts

Comments