Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
185 views

Named Entity Recognition Using Deep Learning

This document describes using a deep learning approach with ELMo embeddings and a bi-directional LSTM for named entity recognition (NER). NER aims to locate and classify named entities in text into predefined categories like person names, organizations, locations etc. The approach uses pre-trained ELMo embeddings to capture word context and meaning, and a bi-LSTM to understand sequences of words and labels. The model achieves good performance, with an F1-score of 81.2% for NER and 97.1% for part-of-speech tagging on a benchmark dataset, outperforming a popular NER tool. The code is available on GitHub for others to replicate the approach.

Uploaded by

Zerihun Yitayew
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
185 views

Named Entity Recognition Using Deep Learning

This document describes using a deep learning approach with ELMo embeddings and a bi-directional LSTM for named entity recognition (NER). NER aims to locate and classify named entities in text into predefined categories like person names, organizations, locations etc. The approach uses pre-trained ELMo embeddings to capture word context and meaning, and a bi-LSTM to understand sequences of words and labels. The model achieves good performance, with an F1-score of 81.2% for NER and 97.1% for part-of-speech tagging on a benchmark dataset, outperforming a popular NER tool. The code is available on GitHub for others to replicate the approach.

Uploaded by

Zerihun Yitayew
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Named Entity Recognition using Deep

Learning(ELMo Embedding+ Bi-LSTM)

Introduction :

Named-entity recognition (NER) (also known as entity


identification, entity chunking and entity extraction) is a
subtask of information extraction that seeks to locate and classify
named entities mentioned in unstructured text into pre-defined
categories such as person names, organisations, locations,
medical codes, time expressions, quantities, monetary
values, percentages, etc.

It adds a wealth of semantic knowledge to your content and helps you


to promptly understand the subject of any given text.

Applications :

Few applications of NER include: extracting important named


entities from legal, financial, and medical
documents, classifying content for news providers, improving
the search algorithms, and etc.

Approaches to tackle this problem:


1. Machine Learning Approach : treating the problem as
a Multi-class classification with named entities are our labels .
The problem here is that for longer sentences identifying
and labelling named entities require thorough
understanding of the context of a sentence and sequence
of the word labels in it, which this method ignores and cannot
capture the essence of the entire sentence.

2. Deep Learning Approach : The best possible model which can


tackle this problem is Long-Short Time
Memory(LSTM) models, specifically we will use Bi-
directional LSTM for our setup . A Bi-directional LSTM is a
combination of two LSTM’s — one runs forward from
“right to left” and one runs backward from “left to right”,
thus capturing the entire essence/context of the
sentence . For NER, since the context covers past and future
labels in a sequence, we need to take both the past and the future
information into account.
Bi-LSTM

Embedding Layer : ELMo (Embedding from Language


Models): ELMo is a deep contextualised word representation
that models both ,

(1) complex characteristics of word use (e.g., syntax and


semantics), and

(2) how these uses vary across linguistic contexts (i.e., to model
polysemy). Example: Although ‘Apple’ term is common,
but ELMo will give different embeddings for both (fruit and
organisation) due to contextual logic.

Example: Also we need not worry about the Out-Of-


Vocabulary(OOV) token of training data , since ELMo would generate
a character embedding for that as well.

These word vectors are learned functions of the internal states of a


deep bidirectional language model (biLM), which is pre-trained on a
large text corpus. They can be easily added to existing models and
significantly improve the state of the art across a broad range of
challenging NLP problems, including question answering,
textual entailment and sentiment analysis.
ELMo

Let’s see how we can approach this problem :


1. Data Acquisition : We are going to use a dataset from Kaggle.
Please go through the data to know more about the different tags
used .
We have 47958 sentences in our
dataset, 35179 different words ,42 different POS and 17 different
named entities (Tags).

In this article we will build 2 different models for


predicting Tag and POS respectively.

2. Next we would use a Class which would convert every sentence


with its named entities (tags) into a list of tuples [(word, named
entity), …]
3. Let’s have a look at the distribution of the sentence lengths in the
dataset. so the longest sentence has 140 words in it and we can see
that almost all of the sentences have less than 60 words in them.
But due to hardware crunch we would use smaller length .i.e. 50
words, which can be easily processed.
4. Let’s create word-to-index and index-to-
word mapping which is necessary for conversions for words before
training and after prediction.
5. From the list of tuples generated earlier, now we will build the
independent and dependent variable structure.

 Independent variable / Words corpus :


 And the same applies for the named entities but we need
to map our labels to numbers this time
6. Train — Test Split (90:10):

7. Batch Training : Since we have 32 as the batch size, feeding


the network must be in chunks that are all multiples of 32.

8. Loading ELMo Embedding Layer :We will


import Tensorflow Hub ( a library for the publication, discovery,
and consumption of reusable parts of machine learning models) to
load the ELMo embedding feature and create a function so that we
can load it in the form of a layer , and start building
our Keras network.

Please downgrade your Tensorflow package, to use this


code. If you want to perform the same in TF 2 or
greater ,you have to use hub.load(url), then create a
KerasLayer(… , with trainable=True).

9. Designing our Neural Network:

 Embedding layer(ELMo): We will specify the maximum length


(50) of the padded sequences. After the network is trained, the
embedding layer will transform each token into a vector of n
dimensions.

 Bidirectional LSTM: Bidirectional LSTM takes a recurrent


layer (e.g. the first LSTM layer) as an argument. This layer takes
the output from the previous embedding layer .
 We will use 2 Bi LSTM layers and residual connection to
the first BiLSTM

 TimeDistributed Layer: We are dealing with Many to Many


RNN Architecture, where we expect output from every input
sequence. Here is an example, in the sequence (a1 →b1, a2 →b2…
an →bn), a, and b are inputs and outputs of every sequence. The
TimeDistributeDense layers allow Dense(fully-connected)
operation across every output over every time-step. Now using
this layer will result in one final output.
10. Training : Ran this for only 1 epoch since it was taking a lot of
time. But the results are awesome.

11. Batch Prediction and using index-to-tag to convert the


predicted indices back to word format .
12. Evaluation Metric : In case of NER, we might be dealing with
important financial, medical, or legal documents and precise
identification of named entities in those documents determines the
success of the model. In other words, false positives and false
negatives have a business cost in a NER task. Therefore, our
main metric to evaluate our models will be F1 score because we need
a balance between precision and recall.

 We were able to get F1-Score of 81.2% which is pretty


good, if you look at the Micro,Macro and Average F1
scores as well they are pretty good. If you train this for
more epochs you would definitely get better results.

13. Comparing our results with SPACY: We can see our model
was able to detect every tag correctly even in single epoch.
Our Model Results

SPACY

14. Parts of Speech Tagging/Prediction: Since, we also


had Parts of Speech (POS) in our dataset, we can build similar
model for predicting that as well. I have implemented that as
well and trained that for 1 epoch and the results were again
awesome.

 We were able to get F1-Score of 97.1% which is pretty


good, if you look at the Micro,Macro and Average F1
scores as well they are pretty good.
Comparing our results with SPACY: We can see our model was
able to detect every tag correctly even in single epoch.
Our Model results
SPACY

Thanks for reading this blog. If you liked it please clap,


follow and share.

Where can you find my code ?

Github : https://github.com/SubhamIO/Named-Entity-Recognition-
using-ELMo-BiLSTM

References :

1. https://jalammar.github.io/illustrated-bert/

2. https://arxiv.org/pdf/1802.05365.pdf

3. https://en.wikipedia.org/wiki/Named-entity_recognition
4. https://allennlp.org/elmo

5. https://sunjackson.github.io/
2018/12/11/1ef8909353df3395a36f3f4d3336269b/

You might also like