Text Extraction Through Video Lip Reading Using Deep Learning
Text Extraction Through Video Lip Reading Using Deep Learning
Text Extraction Through Video Lip Reading Using Deep Learning
8th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd November, 2019
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
S.M. Mazharul Hoque Chowdhury1, Mushfiqur Rahman2, Marzan Tasnim Oyshi3 and Md. Arid Hasan4
Department of Computer Science & Engineering,
1
Abstract— Automated text extraction from video data person is trying to say who could previously talk but lost
through lip reading can overcome the language barrier and his voice due to any accident. For that we can just simply
open the door of opportunities in terms of security, connectivity convert the audio less video data into text or speech. In this
and physical challenges. The conversion is possible by research we are going to discuss about converting this type
analyzing facial expression using deep learning method. But
of video data into text data.
this conversion is a challenging task due to the varieties of
pronunciation and accents of the same word causing different II. Literature Review
countenance. In this research, a method of converting video There is a numerous number of work done for audio
data to text data through lip reading has been proposed. The
to video sync quandary but for audio less data to audio/
proposed method includes test dataset, image frame analysis
and having text output from identified words. In the proposed text (visual verbalization apperception) is still a rear case
technique, the test dataset will be organized by combining all now. These works are fundamentally done predicated on
the possible facial expressions of different words. some concrete methods by individual researchers. Zhou
Keywords: Automated, Audio, Conversion, Frame, et al. [1] surveyed a method in this regards on their research
Identification, Training Data, Video, Sequence, Word with less detailing but efficacious. Many types of research
have been done on this field and everyone followed
I. Introduction
relatively similar methods to extract the information
Technologies are developing frequently all over (features around lips) and then relegate these to a template.
the world. People are getting introduced with incipient Pfister et al. [2] distinguished lip movement by the state
contrivances and technological solutions every single day. of mouth openness by utilizing a single SIFT descriptor of
Cameras are one of the mostly used contrivance by integrating the facial region. Pei et al. [3] described in a research the
incipient features and exhibiting their capabilities in state of art on many databases, extract features and then
different sectors. Now it is possible to detect face, ascertain aligned these for motion patterns. Koller et al. [4] utilized
emotion through countenance, object identification and a deep convolutional neural network to extract dactylology
many more. Even medical imaging made it facile to detect from effervescent mouth shapes. Similarly, Zoric et al. [5]
diseases, encephalon damage, bone fracture, etc. On the encoded the sample image, frame them, train them after that
other hand, keenly intellective computer systems now can relegate, to engender a word level classification. Chen [6]
convert audio data into text data. It can be habituated to represented a verbalization availed frame rate conversation
identify suspicious conversation and avail the physically method for verbalization availed video encoding. Lavagetto
challenged people who are struggling with hearing. Audio [7] developed a multimedia telephone for hard auricularly
less video data is one of the very commonly available data. discerning persons from the conversation by graphic
But retrieving audio data from audio less video data is a animation fit to lip-reading.
very arduous task. But if we can identify and analyze the Word classification with a bag of word /lexicon has
countenances of video data, we will be able to retrieve text not been endeavored in visual verbalization apperception,
data from audio less video data. This will help to reduce however [8] has tackled the same drawback within the
the crime rate or maybe we will be able to identify what a context of text apperceiving. Their work shows that it’s
240 Copyright © IEEE–2019 ISBN: 978-1-7281-3245-7
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
Text Extraction through Video Lip ReadingUsing Deep Learning
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
8th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd November, 2019
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
This system is entirely dependent on the training data As this system will identify words according to the
set provided to it. The more data is provided to it, the more movement of lips, so it will be possible to detect any kind of
it becomes precise. Predicated on the angle or position word from any language. Even emotions can be predicted
of the speakers face system will provide different output. from the visage with the word. So it will be more facile to
Because predicated on the angle the movement may seem identify what authentically the verbalizer meaning through
different form different side. So for a particular word the sentence. This system has a profoundly and immensely
multiple training data will be required form different angle. colossal use in our authentic life. Moreover it withal have
Moreover verbalizing a word may vary from person to some back draw and complexities. The prodigious amount
person. So it is withal paramount to take the training data of data cannot be stored in our traditional databases and if
from multiple utilizer. As there will be an abundance of not handled conscientiously data will not be able to provide
training data for a single word, so amount of total training good result after analysis. There is no such limit for this
data will be high additionally. But higher amount of data type of training data for better precision.
will give us higher precision. This work can be compared with the work of Ke
After building the training data set with enough data Sun’s teams Lip-Interact technique that allows interaction
items of word and its video frames system will be able to between smart device and human [9]. The main similarity
deal with the test data. To convert the voiceless video data between our researches that it includes language processing
into text data it is paramount to build such a system that can from the lip movement of its user. According to the project
handle video data as well as text data. For that the system Lip-Interact collects silent command from its user where
will require to map the lip of the utilizer and how it changes the model is trained with some fixed inputs as do, undo,
during it is verbalizing about something. The system will screenshot, open camera and so on. The front camera of
be able to break the video into image sequences first. Every the smart device gets the input while user looks at the
person takes a minimum time to verbalize a word after screen and says something. To improve the quality of
another word. So to identify the word it is paramount to the command recognition they used Spatial Transformer
identify the time a person takes. When someone verbalize Networks. Their training model included real time user
they take minuscule break, which is may be micro seconds. inputs as data set instead of random data set and news data.
May be someone takes longer than that, but it will not This can be a good solution for a small amount of words
transmute anything in the system. Because the minimum because complexity will reduce because the possibility
time will not be transmuted. The longer break someone of in accurate data is very low and noisy data is already
takes the more accurately the system will be able to identify preprocessed. Therefore, the number of training data will
a word. be in the good rang that can easily be handled. But for a
After breaking the test video data into image sequences general conversation it will be difficult though our research
like TD-X in fig. 2 the system will split them according is still under process to improve the analysis.
to the time space for verbalizing one word after another. An example of the data training can be explained
During this process the system will additionally eliminate through the sample given below in figure 3-
the frames in this time gape. Because in the training data
all blank or extraneous frames were withal expunged to
eschew any kind of error or mismatch of the word. When
splitting is done as fig. 2 the system is now yare to analyze
the test data. As mentioned afore system is already trained,
ergo now system will match every image sequence from
the test data set with the training database. For each match
system will engender a word and store it to a text file. Then
it will move for the next sequence matching.
In this system grammar is not the main priority, so Fig. 3: Data training for word recognition
ascertaining what someone is verbalizing will be the According to the given sample which was collected
main focus. Because predicated on the rudimentary text from the research regarding Lip-Interact shows that
architecture it is quite possible to predict about what the camera which works as a sensor in this research is
someone is verbalizing about. In the text file all the words working as a lip gesture identifier and taking snap of every
must be placed in an order according to the sequence in the movement from the video and there is a condition of time
video. Because if order is broken it can transmute entire gap about after how much time the image will be taken.
meaning of the sentence. When all the sequence are being Figure 4 is showing the movement of lip movement while
analyzed and their corresponding words are situated to the a user is saying something from the research of Ahmad B.
text file the process will culminate and a text file will be A. Hassanat regarding the research of automated lip reading
engendered as output. [18]. Each word speaking have its own type of movement
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
Text Extraction through Video Lip ReadingUsing Deep Learning
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.