Deep Learning For Image Processing Applications
Deep Learning For Image Processing Applications
APPLICATIONS
Advances in Parallel Computing
This book series publishes research and development results on all aspects of parallel computing.
Topics may include one or more of the following: high speed (HPC) and high throughput
computing (HTC) architectures, including grids, clouds, clusters, Service Oriented Architectures,
etc., network technology, performance measurement, system software, middleware, algorithm
design, development tools, software engineering, services and applications from all scientific and
engineering fields, including data science and analytics.
Series Editor:
Professor Dr. Gerhard R. Joubert
Volume 31
Recently published in this series
Vol. 30. G. Fox, V. Getov, L. Grandinetti, G. Joubert and T. Sterling (Eds.), New Frontiers in
High Performance Computing and Big Data
Vol. 29. M. Mittal, D. Jude Hemanth, V.E. Balas and R. Kumar (Eds.), Data Intensive
Computing Application for Big Data
Vol. 28. C. Trinitis and J. Weidendorfer (Eds.), Co-Scheduling of HPC Applications
Vol. 27. G.R. Joubert, H. Leather, M. Parsons, F. Peters and M. Sawyer (Eds.), Parallel
Computing: On the Road to Exascale
Vol. 26. L. Grandinetti, G. Joubert, M. Kunze and V. Pascucci (Eds.), Big Data and High
Performance Computing
Vol. 25. M. Bader, A. Bode, H.-J. Bungartz, M. Gerndt, G.R. Joubert and F. Peters (Eds.),
Parallel Computing: Accelerating Computational Science and Engineering (CSE)
Vol. 24. E.H. D’Hollander, J.J. Dongarra, I.T. Foster, L. Grandinetti and G.R. Joubert (Eds.),
Transition of HPC Towards Exascale Computing
Vol. 23. C. Catlett, W. Gentzsch, L. Grandinetti, G. Joubert and J.L. Vazquez-Poletti (Eds.),
Cloud Computing and Big Data
Vol. 22. K. De Bosschere, E.H. D’Hollander, G.R. Joubert, D. Padua and F. Peters (Eds.),
Applications, Tools and Techniques on the Road to Exascale Computing
Vol. 21. J. Kowalik and T. Puźniakowski, Using OpenCL – Programming Massively Parallel
Computers
Vol. 20. I. Foster, W. Gentzsch, L. Grandinetti and G.R. Joubert (Eds.), High Performance
Computing: From Grids and Clouds to Exascale
Vol. 19. B. Chapman, F. Desprez, G.R. Joubert, A. Lichnewsky, F. Peters and T. Priol (Eds.),
Parallel Computing: From Multicores and GPU’s to Petascale
Edited by
D. Jude Hemanth
Karunya University, India
and
Vania Vieira Estrela
Universidade Federal Fluminense, Brazil
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, without prior written permission from the publisher.
Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: order@iospress.nl
LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.
Preface
Deep learning and image processing are two areas that interest many academics and
industry professionals. The main objective of this book is to provide concepts about
these two areas in the same platform. Professionals from academia and research labs
have shared ideas, problems and solutions relating to the multifaceted aspects of these
areas.
The first chapter deals with an interesting introduction to deep learning: the rela-
tion between man, mind and intelligence, which is dealt with here. This provides an
excellent foundation for subsequent chapters. The second chapter demonstrates the
application of deep neural networks for image classification. A wide range of images
are used in this application, proving the robustness of the proposed approach. Hand
gesture recognition with deep neural networks is analyzed in the third chapter. An in-
teresting aspect of this chapter explains how recognized hand gestures are used to con-
trol the robotic arm.
Deep learning techniques for image retrieval are discussed in the fourth chapter.
The significance of increasing multimedia data in real time and the necessity for effi-
cient search processes is also stressed in this chapter. The fifth chapter concentrates on
from human images using deep learning techniques. The sample disease used in this
approach is a form of diabetes commonly found in humans. The sixth chapter deals
with the application of tuberculosis detection in the human body through deep learning
approaches. Experimental results show promising results for the proposed technique.
Object retrieval from images using deep convolutional features are discussed in the
seventh chapter. Convolutional neural networks are used for the experimental analysis
in this work. The eighth chapter highlights the application of hierarchical object detec-
tion with deep reinforcement learning approaches using different variations of the im-
ages. A comparative analysis of deep data and big data are discussed in the ninth chap-
ter which adds a different dimension to the preceding content.
Vehicle type recognition using sparse filtered convolutional neural networks is
discussed in the tenth chapter. Images from publicly available database are used for the
experimental analysis in this work. The application of deep learning approaches for
surveillance and security applications is discussed in the eleventh chapter. The final
chapter talks about the possibility of enhancing the quality of images captured from a
long distance using deep learning approaches. The variety of content in these chapters
provides an excellent platform for researchers working in these areas.
We would like to express our gratitude to all of the authors who submitted chapters
for their contributions. We also acknowledge the great efforts of the reviewers who
have spent their valuable time working on the contents of this book. We would also
like to thank Prof. Gerhard Joubert, Editor-in-Chief, Advances in Parallel Computing
series and IOS Press for their constant guidance throughout this book project.
D. Jude Hemanth
Vania Viera Estrela
This page intentionally left blank
vii
Contents
Preface v
D. Jude Hemanth and Vania Viera Estrela
About the Editors vii
1. Introduction
Vision can be considered as the most formidable sense organ of the humans. Indeed,
majority of the problem solving abilities emanates from the vision. It provides the
ability to visually simulate the effects of actions in mind. AI researchers have been
created vision system in machines. The vision system of machines augments the
various abilities of machine including its ability to process the images. Human being is
2 S.C. Pandey / Mind, Machine, and Image Processing
biologically intelligent agent and it is intuitive to think that “can machines process the
images like human mind?” There are numerous parameters which vividly influences
this question. Within the realm of classical artificial intelligence and cognitive science
it is obvious to think that cognition can be reproduced by the machines and it also plays
vital role in image processing. It is imperative to mention that animal intelligence
displays extremely effective functioning in image processing. Animals live in an open
world and researchers are trying to overcome the closed world hypothesis of the
machine. In fact, many simple animals such as insects, fish, and frogs exhibit
intelligence behavior yet have virtually no brain. Therefore, they must be doing
something that avoids the frame problem. Further, it is assumed that sensors would
give an accurate representation of the world. However, this is not often true. Sensors
are always noisy and have vulnerabilities. Therefore, it is obvious that images are
processed in the presence of uncertainty.
In fact, in different stream of science e.g., cognition, computation, psychology,
philosophy, linguistic, anthropology, neuroscience etc, the term intelligence has
varying implications. Genesis of the word intelligence is supposed to be from Latin
word “intelligere” and is also associated with the Greek philosophical terms “nous”.
The meaning of “intelligere is to perceive or comprehend. In medieval period, the word
“intellectus” were used for understanding. In past, these terms were used as impetus for
explaining the metaphysical and cosmological theories of immortality pertaining to the
soul. The common definition of intelligence assimilates different abilities to acquire,
understand and apply knowledge. It can also be considered as the ability to exercise
thought and reasoning process. Moreover, it is a beautiful amalgamation of thought,
imagination, sight, and sound perception. Precisely we can say that intelligence
encompasses the knowledge and facts acquired during the conscious and unconscious
states. Further, one can express and feel emotions and perhaps much more through
intelligence and it is acquired through study and experience.
Philosophically, intelligence displays a dual attribute with mind. It is subtler than
the mind and at the same time acts as a substrate of mind to engage it in consciousness
which in turn further strengthen the mind together with senses. It is obvious that
rationality and irrationality, right and wrong of action are differentiated by means of
intelligence. Intelligence is also pertinent to brain which is the substrate of mind. Brain
can be considered analogous to the hardware of computing machines. Moreover, the
hierarchical regions of the brain possess the ability to learn as well as to predict its
future input sequence. The hierarchy of intelligence, mind, working senses, and dull
matter is shown in Figure 1.
There are many implicit aspects related to mind. However, the insight to such
related aspects is still obscure. In [1], the modern version of identity theory was
proposed by Herbert Feigl. Further, it is inquisitive to think the relationship between
mind, brain, and body. Perhaps, when we discuss the theory of mind philosophically,
this relationship and other implicit aspects related to mind are of the paramount interest
and deserve to discuss. Moreover, these questions also generate many other questions.
Indeed, mind has always been a topic of interest for philosophers and different
inquisitive questions related to mind still need philosophical explanation. In [2],
attempt has been made to elucidate the different questions i.e. what is mind and brain
and how is it related to body and how it affect the physical world? Moreover, in [2] a
fundamental question is raised that whether the mind resides inside the head or in
environment. Logically, it is hard to know about the others mind whether it is a living
being or machine and it is also not easy to get the true recognition of the self. In [2],
S.C. Pandey / Mind, Machine, and Image Processing 3
substantial attempt has been taken to elucidate all these inquisitive questions. An
extensive literature survey reveals the fact that the views on these issues can be
categorized into different domain. These are the:
• Dualism
• Materialism.
• Idealism
• Functionalism
• Anomalous monism
As per the canon of dualism, mind, brain and body are distinct entity. Moreover,
there is dichotomy between mental and physical states. However, mind, brain, and
body are associated in some fashion. In lieu of dualism, materialists consider the mind
as a physical entity. Materialists believe that mental states are derivative of physical
states. The idealism theory contradicts the materialists’ theory. The idealism theory
states that physical states are derivative of mental states. Further in [3], a theory was
proposed which was against the dualism. Functionalism stems from behaviorism and
the identity theory. Functionalism states that mental states implicate functional states
anyway. These functional states play peculiar act within the purview of cognitive
system. Functionalism was developed in the 1960s by David Armstrong [4].
Furthermore, anomalous monism view connects mental events with physical events.
This theory further emphasizes that there are no strict laws for governing of the mental
events themselves. In [5], it is argued that in general, the high level concept of an entity
cannot be obtained in a low-level concept of the same entity. This is the reason behind
the fact that when three dimensional images transformed in two dimension
considerable information loss occur. If one is trying to capture the high level concepts
in low level then it will become complex and arbitrary. This concept is more viable
particularly in the domain of psychology. Within the purview of metaphysics, mental
causation and the way mind influence the physical world is the central point of
consideration. This issue is considered as the tangible difficulty for any form of
dualism and materialism [6].
The mind and consciousness are two different entities. We can perceive that
consciousness connects the mind with body as shown in Figure 1. In fact, a conscious
system has to be aware as well as intelligent. It is a common phenomenon that
unconscious state cannot convey pain or pleasure of the body to mind and vice versa.
In [2], different phenomena pertaining to consciousness like what is consciousness and
could a purely physical system be conscious are discussed. Further, it is obvious to
think that is it possible to explain subjective experience in objective terms and how
does the mind represent the world? These issues are discussed in [2]. The nature of
belief and desire is also elucidated in [2]. Indeed, philosophy deals substantially and
extensively with the intricacies of human mind. In fact, the oriental philosophy tries to
establish the correlation between the thought process concerning the human’s well
being and mind.
Rest of the chapter is organized as follows: Section 2 gives the brief description of
artificial intelligence (AI). Section 3 presents the brief literature survey of image
processing. In section 4 attempts have been made to analysis the different aspects of
mind, machine and images. Further, section 5 proposes different reasons which make
4 S.C. Pandey / Mind, Machine, and Image Processing
It would be appropriate to define the term “intelligence” before defining the artificial
intelligence. In [7], a profound definition of intelligence is given. However, in this
definition of intelligence the relationship between intelligent agent and environment is
missing. It is obvious that agent will be influenced by the external world and the action
of agent will also influence its environment. From Dreyfus’ point of view, this is the
‘situatedness’ of intelligence.
Indeed, it is not easy to define the term “Artificial Intelligence” (AI). Perhaps, it is
a weird philosophical question and its systematic answer will require foundational
details of AI.
Indeed, different definitions of AI reveal many new questions such as what are
intelligent behavior and its characteristics, what exactly is the definition of mind and
what are the essential components of mind. It is also inquisitive to think that how to
develop mind in artificial agent for various purposes including performing the image
processing and how artificial agent mimics the way human being manage to behave
intelligently. Further, the nature of last question is not deterministic. Perhaps, the
phenomenon of last question is empirical in nature and to answer this question requires
the canon of psychology and cognitive science. Further, it is obvious that without
having insight of the human thought process it would be difficult to build the intelligent
machine. As a matter of fact, a deep philosophical contemplation of all these questions
will certainly influence the course of AI.
In fact, there is no common globally accepted definition of Artificial intelligence.
Perhaps, this is the reason researchers declare the belongingness of their work from the
domain of “intelligent system” or “knowledge-based systems”. However, it can be
consider as the science of making machines whose act is usually referred as artificial
intelligence. In [8], it is defined as “the study of ideas that enable computers to be
intelligent”. Further, in [9, 10], AI is defined in a simple and elegant manner.
Further, some researchers believe it is possible that machine cognition can perform
many tasks more efficiently than the human intelligence including processing of the
images. However, some researchers do not believe this phenomenon and they state that
there is need to reset and redefine the philosophy and theories of AI as well as the
relationship of AI and cognitive science. Indeed, there is need to redesign the modified
version of AI from the realm of technical AI. Furthermore, some researchers also
emphasizing the replacement of classical cognitive science with embodied cognition
theory. We should also incorporate innovative approaches of AI like neuroscience and
system theory and undoubtedly this will require reconsideration of basic questions
related to computing, cognition, and ethics pertaining to AI.
2.1. History of AI
from the domain of AI and experimental techniques from Psychology and thus
constructs the pragmatic theories pertaining to the working of human mind. Many
ancient philosophers considered intelligence as a rational part of mind. Furthermore,
the concept of logic which was a science of reasoning was commenced by Aristotle. It
yielded the conclusion of a problem by proceeding in a step by step manner starting
from a fixed set of prepositions. However, Rene Descartes (1596-1650), a French
philosopher and mathematician pointed out a problem which was purely related to the
physical and mechanical conception of mind. The development of logic paved the way
of understanding the functioning of mind mathematically. Subsequently, George Boole
(1815-1864) preludes the mathematical foundation of symbolic logic to Aristotle’s
system of reasoning as we understand it today. Further, Kenneth Craik (1914-1945)
developed the domain of cognitive psychology.
The notions of today’s computers have been formalized by Alen Turing (1912-
1954) and he also rendered the mathematical characteristics regarding the functioning
of computer program. Furthermore, the substantial concept of Decision Theory has
been proposed by Von Neumann (1903-1957). Indeed, the Decision Theory was
precursor of a new theory to distinguish the good and bad actions as was hypothesized
by Socrates. In [11], McCulloch and Pitts introduced that how neurons might work in
the brain. They used an electrical circuit to model a simple neural network and thus the
sub-field of neural computing commenced its use in the AI.
Norbert Wiener compared the different characteristics of human and machine
including the study of communication in human and machine which in turn emerged
the cybernetics as an active area of research [12]. Cybernetics hybridized the concepts
from information theory, feedback control systems (both biological and machine), and
electronic computers. Further, the developments which have been made in formal
grammars and logic during the early 1900s helped to provide new approaches of
language theories. Further, several prototype systems have been developed including
the Mark I Harvard relay computer (1944), the University of Pennsylvania Moore
school of Electrical Engineering’s ENIAC electronic computer (1947), and subsequent
development of the Aberdeen Proving Ground’s EDVAC and Sperry-Rand’s UNIVAC.
In this conjunction, it is genuine to say that the introduction of information theory
which is mainly due to the work of Claude Shannon [13], neurological theories and
models of the brain originated by psychologists, as well as the introduction of Boolean
algebra, switching theory etc. further augmented the efficacy of AI. In fact, AI is not
just the outcome of this century. Much ground work had been laid by many earlier
researchers dating back several hundred years. These were the Aristotle, Leibnitz,
Babbage, Hollerith, and many others which played significant roles in building of a
foundation that eventually led to what we now know as AI. One of the first programs
for automatic theorem proving was completed by Newell, Shaw, and Simon [14].
Further, Newell and Shaw developed programming for the logic theory machine [15].
The consequence of this development was completion of the processing language
called IPL (Information Processing Language). Moreover, the development of
FORTRAN (begun in 1954) and Noam Chomsky’s work had a strong impact on AI.
The pattern recognition and self-adapting systems were the central area of interest
by the researchers. During this period Rosenblatt’s perceptron were receiving much
attention [16]. Perceptrons are types of pattern recognition devices that have a simple
learning ability based on linear threshold logic. Further, McCarthy developed the LISP
programming language [17] which was the recognized programming language of A.I.
6 S.C. Pandey / Mind, Machine, and Image Processing
2.2. Approaches of AI
• Acting Humanly: It takes the Turing test approach [22]. If a machine passes
the idea of Alan Turing’s (1912-1954) “imitation game” it is said to be
intelligent. In this game, both the interrogator and computer can communicate
to each other by textual message. If the interrogator would not detect by
questioning that a computer is there in the next room then we can consider that
computer intelligent.
• Thinking Humanly: It is based on the cognitive modeling approach.
• Thinking Rationally: This approach is based on the logic and is incepted by
the Greek philosopher Aristotile (384-322 B.C.).
• Acting rationally: It takes the rational agent approach. Agent can change its
state by performing a set of pre-defined actions. For example, navigating robot
has sensors and it can perform actions such as “turn left” and “go ahead”.
In fact, there are many associated problems with the achievability of these tasks.
These problems are mainly related with how to render intelligence that could be useful
in applications. There are many factors which influence the usefulness of intelligence
in applications such as background knowledge, intractability of computations, and
structural limitations of the knowledge representation. The problems related with the
achievability of these tasks are of diversified nature that arose from design community
as well as from philosophical community.
In fact, many philosophers observed a clear rationalist view in the artificial
intelligence. However, Dreyfus did believe in anti-rationalist paradigm. Further, the
rational approach of AI is considered as the foundation stone of classical AI.
Furthermore, Dreyfus analyzed this as a combination of three fundamental hypotheses:
• Psychological hypothesis
• Epistemological hypothesis
• Ontological hypothesis
Intelligence
Mind
Intelligent Consciousness Aware
Working Senses
Alive
Dull Matter
Body
Philosophy
Psychology Linguistics
Artificial
Intelligence Anthropology
Neuroscience
Figure2. Connection among the artificial intelligence, based on the Gardner, 1985, p. 37 [19]
8 S.C. Pandey / Mind, Machine, and Image Processing
equalization (AHE) is used as an enhancement capability on the soft tissue [40]. In this
method, due to the high resolution, the images were cropped before being processed
using adaptive histogram equalization. In [41], a diffraction-enhanced imaging is given
and the capability of this method is observed for different types of tissues. This
technique generates high spatial resolution and contrast of calcified and soft tissues.
Indeed, it is a synchrotron based imaging technique. The method of medical image
enhancement based upon non-linear technique and the logarithmic transform
coefficient histogram equalization is proposed in [42]. It has been observed that the
method given in [42] improves visual quality of images. Experimental results render its
superiority over other commonly used enhancement techniques. The gray level co-
occurrence matrix based image processing method is introduced for face recognition in
[43]. Furthermore, a shape “break-and-repair” strategy is presented for medical image
segmentation and it is applied to the segmentation of human lung and pulmonary
nodules [44]. In this strategy, the regions that may cause any problems in segmentation
are removed and then estimated using implicit surface fitting based on radial basis
function. Moreover, a novel predictive model called active volume model is proposed
in [45] for object boundary extraction. In [46], the use of software with edge detection
and segmentation methods is given. This method produces the edge pattern and
segment of brain and the brain tumor. Further, medical imaging is expensive and very
much sophisticated because of proprietary software and expert personalities. In [47],
the tasks of image analysis have been reviewed. In [48], it has been reported that in
order to achieve an effective way to identify skin cancer at an early stage without
performing any unnecessary skin biopsies, digital images of melanoma skin lesions can
be investigated. Feature extraction plays vital role for image analysis in this approach.
Moreover, a method is described for detecting effectively the presence of cancer cells
in [49]. This method also reduces the overall time taken for diagnosis by carrying the
whole process under biotelemetry. In addition, the work given in [50], describes an
application for digital image processing and analysis. This method is substantially
beneficial to healthcare domain to predict some major diseases of human beings. In the
image processing system given in [50], the images of human palm are given as input to
the system. This is why this method is supposed to work on medical palmistry.
It is intuitive to think that can mind be explained in terms of machine? Can mind be
replicated on machine? Very often the artificial intelligence (AI) researchers consider
the mind as software and brain as hardware. They also consider the mind as pattern and
the brain as a substrate. Further, researchers have been tried to establish the analogy of
“senses as inputs”, and “behavior as outputs”. Neurons are considered as the smallest
processing units of the brain and synapses as “circuitry”. Indeed, this analogy is simply
based on some casual presumption and is not justified by the actual working of the
computer. It is important to analyze if there is any philosophical base under this
analogy. It is also important to envisage the implication of this analogy from the
subject point of view whether it correctly reveals the present state of AI research.
Further, it is pragmatic to think whether the metaphor “mind-as-machine” can be
treated as a discipline or it is merely a dogma. Broadly speaking, computer is a
machine that can perform many different procedures rather than just one or few and
10 S.C. Pandey / Mind, Machine, and Image Processing
Human are able to perceive and understand the world surrounding them, while machine
aims to duplicate the effect of human vision by electronically perceiving and
S.C. Pandey / Mind, Machine, and Image Processing 11
understanding on image. It is difficult task to make machine able to see, perceive, and
understand the images of the world. Further, human beings live in a three-dimensional
world and human mind has developed through the process of evolution the efficacy to
analyze the world around them with incredible efficiency, accuracy, and precision.
However, when machines try to analyze objects in 3D space, the visual sensors usually
produce two-dimensional images. The transformation of images from 3D space to 2D
space causes huge amount of substantial information loss. In some peculiar cases, some
devices can produce 3D images but this might be a debatable issue and even more
analyzing 3D datasets is explicitly more complicated than 2D.
Indeed, the scientific evidences obtained so far are in favor of the “transcends
biology” and is against the materialistic approach. The perception that we can create a
machine which has a mind to process the image is not a new one. It has featured in
entertaining and frightening fictions since long. In early to mid-twentieth century, this
fiction has emerged the term of “mechanical men” or robots. In present scenario, the
idea of a mechanical mind has paved the way to the new stream of computational
artificial intelligence. Indeed, the enhancement of artificial intelligence domain
comprises of technological as well as philosophical advancement. In twentieth century,
the philosophy of mind has observed a substantial enrichment and thus the currently
dominant theory becomes the precursor for the possibility of machine mind which can
process the images like human mind. In fact, different philosophical theories also
require a sound understanding of what a machine is? It is author’s candid view that
mind, intelligence, consciousness, emotions etc, all are the outcome of vivid intensity
of bio-chemical reactions which are taking place inside the brain cells and it is
inquisitive to think that is it possible to get the same outcome from mechanical means
that we are getting from bio-chemical reaction that too even when we do not know
exactly about that bio-chemical reactions. This is what we really trying for
advancement of mechanical mind for the pursuit of image processing. Further, we
must have to impose some necessary conditions on the development of artificial
intelligence and subsequently on mechanical mind e.g., embodied experience is a must
for advancement of semantics which, in turn, are necessary for having a mind.
There are two substantial ways for exogenous development of machine mind for
image processing. These are (1) conceptual understanding of mind so that a machine
mind can be developed and (2) the computational framework of brain. As it has already
been stated that the machine in which embryonic artificial intelligence or mind is
created for image processing must have the capacity of embodied experience and this is
motivated by the fact that biological being also possess the capacity of embodied
experience. Therefore, the machine in which mind is to be inculcated must possess
sensory components to establish relations between it and the external world so that
machine can receive the images of the world as an input. Perhaps, to develop a machine
mind for the purpose of image processing similar to human mind is still in the primitive
stage and much philosophical and technical advancement is required in this pursuit.
Indeed, human mind is surprisingly effective to perform the sequence of operations
such as image capturing, early processing, segmentation, model fitting, motion
prediction, qualitative and quantitative inference etc. However, in case of machine it is
highly complicated to perform these tasks effectively. Each of these tasks requires
number of efficient algorithms and intelligence. Moreover, there are two pertinent
issues in realm of image processing. There are: (a) the closed world assumption and (b)
frame problem. The closed world assumption says that the world model contain
everything the machine need to know. If the closed world assumption is violated, the
12 S.C. Pandey / Mind, Machine, and Image Processing
machine may not perform correctly the image processing task. But, on the other hand,
it is easy to forget to put all necessary details into the world model. As a result, the
success of the machine depends on how well the human programmer can think of
everything.
Further, the manner of processing and distribution of sensory data is an important
phenomenon in image processing. It is intuitive to think that how much a person or
machine or animal is influenced by what it senses. In some paradigms, use of sensor
information is bounded in a specific or dedicated way. In such situation the image
processing is local to each machine function. However, in other paradigms it is
assumed that entire sensory information will be first processed into one global world
model and subsequently subsets of the model will be distributed to other machine
functions as needed. But it is highly difficult for the programmer to entail all the details
in program because resulting world model is likely to be huge. The opposite of the
closed world paradigm is known as open world paradigm. In this paradigm, it is
believed that closed world hypothesis cannot be applied to that particular domain. No
doubt, machines are good at tedious algorithm but the formal representation of the
world and to maintain every change about it is really hard. This renders the image
processing by machine difficult. Furthermore, it is also intuitive to think that how
difficult it would be to modify the planning algorithm if the world model could
suddenly change. The algorithm could get lost between recursions. In addition, the
facts and axioms which frame the world very often become too numerous for realistic
domain and thus make image processing difficult for the machine. In the following
sub-section 4.2, attempt will be made to analyze the intuitive question; could machine
have a mind?
Indeed, the question of whether a machine could ever think, could it experience
emotions, or to have a mind: is not really new one. These questions touch upon deep
issues of philosophy. In this sub-section attempt has been made to give new impetus to
these aspects.
There are mainly two approaches pertinent to the philosophy of mind as mentioned
in section 1. These are the dualism and functionalism. The dualism and functionalism
are substantially endowed by René Descartes and B.F. Skinner respectively. Within the
realm of dualism, mind and intelligence is considered as non-physical and spiritual and
thus it creates doubt to understand or imitate human mind purely on the basis of the
physical means. In lieu of this, the philosophy of functionalism renders the mind
analogous to a computing machine. Functionalism asserts that mind is an information
processing machine like computer and it receives inputs by sensory organs as well as
from existing mental state. Further, mind processes all these information and produces
new behavior and mental states. The two important concepts which can distinguish
dualism and functionalism are ontological and epistemic concept. The ontological
concepts describe the nature of mind and epistemic concepts describe how mind can be
known.
S.C. Pandey / Mind, Machine, and Image Processing 13
Turing test have substantial impact on cognitive science and his work
seems to fall within the purview of behaviorism. In [54], the Turing’s proposal has
been challenged and it was concluded that a program in execution i.e. process could not
understand the conversation. Two important limitations are worthy to consider (1) is it
possible to know with certainty that any other being thinking? (2) Is the conversation
only the paramount way to assess the person’s intelligence?
In fact, from Turing’s perspective the question “Can machines think?” does not
deserve discussion. He asserted that passing the Turing Test constitutes thinking and
perhaps it is a yardstick to measure the substantiality of AI research from experimental
to philosophical level. However, it is my candid view that it is not meaningless to
discuss “Can machine think”? Some time author feels that belief network is a good
attempt to cause machine think. However, it is paramount to dwell upon what does it
mean to think or to feel? What is a mind? Does mind really exists? Perhaps, philosophy
is indispensable for study of mind. However, it is a fact that philosophy does not
provide foundations for core study of mind but its contribution cannot be
underestimated.
Moreover, the knowledge about mind can be obtained by means of introspection.
Contrary to this, Skinner’s approach consider mind as an entity to study the behavior.
Undoubtedly, some AI researcher will not like to use a computer metaphor to describe
the mind but they would prefer to say that such description is not sufficed to understand
the mind. However, the term “mentality” which I would like to replace by the term
“mental process” is chiefly understandable entirely in computational terms. Perhaps,
mental processes display behavior which relies upon observation and experimentation.
Indeed, physical means cannot create tangible reflection of mind and intelligence
because these are the subtle entity. However, philosophically it is possible to have mind
in a machine. Indeed, the concept of machine seems to be based on dichotomy between
dualism and functionalism and neither dualism nor functionalism can explain the true
nature of mind. It requires an alternative approach that considers behavior as evidence
for the existence of non-reductive, semiotic modes of mentality. This approach is
usually treated as “Computational conception of mind”. Further, in the realm of
computation a program is independent of the hardware but it is also a fact that it
depends on some physical representation just to execute. Therefore, comparison of
mental process with computer process leads the question that is there any entity which
if in execution produces “mental process” like program in execution produces
computer process? In this approach “mentality” is treated as “computer process” i.e.,
program in execution and thus human and machines are supposed to operate in similar
ways. However, since the inception of the AI, the replication of ‘mentality’ in machine
was an inquisitive domain. However, certain pitfalls were always associated with
domain. Perhaps, to conclude that whether machine could have mind one should
envisages on following issues:
The experiment given in [54] tried to differentiate between the syntactic and
semantic properties of the symbols. This experiment attempted to extract the syntactic
properties of the symbol and consequently concluded that a program cannot impart
14 S.C. Pandey / Mind, Machine, and Image Processing
cannot perceive symbols e.g., learning of a language is the process of establishing the
connections between syntactic and semantic properties of symbols used in that
language. In order to argue the non-mechanical nature of mind; its highest level
components such as consciousness, emotion, and intelligence seem to be more
pragmatic. This type of argument is given in [55].
Since long, philosophers and theologians have speculated about the nature of
human mind using the functionalism [56]. This theory produces an essence of mental
states. Indeed, functions can also be implemented on the machines and thus
functionalism states that mental states are similar to the software states of a machine.
Therefore, it can be asserted that a machine could have mental states and seems to have
a mind, belief, hope etc. Further, it can again be asserted that machine could have
intelligence.
There are two important theories to describe the nature of mind. One is the given
by Joseph Weizenbaum [52] and second is given by Rodney Brooks [57]. In [55],
higher level of mind is used to describe its nature whereas in [33], lowest levels of
mind are described to assert its mechanical nature. In [57], it is asserted that mind is
also a machine and thus it must be described in computational terms like brain is
supposed to be.
Indeed, both the theories given in [55] and [57] fail to acknowledge that the mind
may be simultaneously like and dislike a machine and it depends on the level at which
it is being described. In [58], it is argued that human intelligence and expertise depends
basically on unconscious instincts. Perhaps, human intelligence and expertise does not
depend on conscious symbolic manipulation. It is an obvious fact that some
phenomenon like meditation cannot possible to describe in computational terms. Many
robots mimic humans and animals. However, human and animals have feeling or
experience of making unconscious decision. It is obvious to think that can a robot make
unconscious decision? It is also important to think whether the functioning of mind
depends upon its physical structure and if the answer is affirmative how are they
associated?
It is also possible that mind might be able to function independently of such
structures. We also have to ponder that is it necessary the relevant structures be
biological in nature (brains) associated with bio-chemical reactions or might mind be
equally well associated with pieces of electronic equipments? Indeed, the quality of
human judgment and understanding emerges from the consciousness.
It is intuitive to think is there any “machine soul” inside the machine which can
create the “machine consciousness” in the machine. Sometime I tempted to envisioned
operating system as the “machine soul” and different system programs as the “machine
consciousness” of different aspects. The author does not know is it genuine to establish
this analogy? And the author would like to leave this aspect on reader to decide
whether this analogy seems comical or sound serious.
From extensive literature survey this fact has been observed that research towards
the philosophical aspects of machines conscious was an active area of research.
However, very limited works have been done in this pursuit by various researchers [59,
60]. Further, some more authors have done substantial work in this direction [61-69].
Moreover, if we assume that the mind is similar to a program even it would not be
possible to conclude that what’s inside the mind is irrelevant [70-74]. Perhaps limited
work in this domain is due to the complex and provocative nature of the related terms
like thought, awareness, attention etc.
16 S.C. Pandey / Mind, Machine, and Image Processing
Do machines can ever be intelligent? Today, computers are being used in different
domain even to improve the human intelligence. Researchers are constantly toiling to
embed the ability and flexibility of human intelligence into the machine while it still
requires lot of efforts. Undoubtedly, AI trailed substantial effort in this pursuit but it is
yet to think whether this rapidly evolving technology can challenge human intellect?
As on date, the most intelligent computer in the world is perhaps as smart as an ant. In
1960, researchers obtained substantial success in the domain of AI and they predicted
that truly intelligent machines might follow in recent future. In [75], statement of
Herbert Simon is quoted: “Machines will be capable, within twenty years, of doing any
work a man can do”. However, true machine intelligence is still not a reality as on date.
Perhaps, it might be materialized in distant future.
Extensive literature survey revealed the fact that by the early 1950’s the primitive
question that ‘is it possible for machine to thin?’ is changed by other question that “is it
possible for machine to deal with physical symbol?’. Further, it is also intuitive to think
that can machine perform structure-sensitive rules thinking? It seems logical that this
question renders an improvement. It is because of the fact that formal logic and
computational theory observed major developments in past five decades. Many
researchers give affirmative answer in response to question could a machine think?
Perhaps, the first reason behind affirmation is related with computability of the
function. Indeed, if a function is effectively computable then it implicates that function
will also be recursively computable. Author would like to elaborate that here the term
effectively computable implicates a procedure which determines the output for a given
input within finite time interval. The second reason behind this is the important result
given by Alan M. Turing. A recursive computable function can be computed in finite
time by Turing machine. Such Turing machine is also called as “Universal Turing
Machine”. Moreover, the Turing machine is a symbol-manipulating machine. Precisely,
these two results implicate that a symbol-manipulating machine is considered as
consciously intelligent machine provided it passes the Turing test. However, these two
results assure that a suitable symbol-manipulating machine could compute an effective
function. Thus, the challenging task before the researcher is two-fold. First challenge is
to identify the complex function. This complex function is supposed to control the
human pattern of response in lieu of environment. Further, the second task is, writing
of the program by which the symbol-manipulating machine can compute it. In fact,
these goals are fundamental in the domain of classical AI. Moreover, these goals create
fundamental research program pertaining to AI.
In fact, there are two profound topic of interest in the realm of AI. These are ‘Can
machine perform intelligently’? Or ‘can computer think and perform action like human
being or other biologically intelligent agent?” depends on how we envisage ourselves
and the machines considered. It is highly complicated and skeptical to believe that
machine which is a man made physical object could possess intelligence like biological
agent which has something mysterious and special. Classical AI and cognitive science
do claim that cognition is computable. Therefore, it is explicit from the perspective of
classical AI and cognitive science that cognition can be reproduced by the machine and
perhaps machine surpasses the abilities of human intelligence. However, it is ridiculous
to believe because the functioning of machines depends strictly on logic; they can do
S.C. Pandey / Mind, Machine, and Image Processing 17
the things for which they are programmed. Therefore, it seems to be paradoxical that a
man made machine touch the ability of its creator.
It is explicit that AI algorithms are consistently finding the solution of given
problem by applying the heuristic approach in which the form of reasoning is not
logical. Researchers have invented innovative designs for problems which have never
been visualized by the system programmer [76]. Computer has learnt to play chess and
perhaps computers can play a provably perfect game [77]. However, can we impart
sixth sense to a machine? Sixth sense or subtle perception is the ability of biological
beings to perceive the subtle-dimension or the unseen world. Sixth sense also facilitates
us to understand the subtle cause and effect relationship behind many events. Moreover,
this ability is beyond the understanding of the intellect and thus it appears that there is a
need to re-establish the philosophy and theory of AI and cognitive science.
Brains are poorly designed for calculation and information storage purposes.
However, in case of machine there is no such limitation. There is plethora of
algorithms for different types of problems which can vividly be used in machines.
From the subject point of view of some peculiar acts, human brain which can also
termed as a ‘meat machine’ is certainly not as efficient as machines. The author would
like to assert that the construction of such intelligent machine which can replicate the
brain’s design might be a reality. Perhaps the brain’s algorithm is close to optimal for
some things, but it is certainly not optimal for all the problems. However, it is likely to
perceive the possibility of improvement over brain regarding the algorithm, hardware
and scale. In fact, it is obvious that brain is highly complex system and is still
incredible to any other man-made objects in many ways. Indeed, we can perceive such
objections as a particular stream of philosophy known as “vitalism” i.e. we suppose
that some supreme power is at work when science could not explain some unusual
phenomenon.
The author wish to point out that if the brain is essentially just a machine, which
appears to be the case then it, is certainly not the most intelligent machine that could
exist. This idea is reasonable once we think about it: machines can easily carry more,
fly higher, move faster and see further than even the most able animals in each of these
categories. Why would human intelligence be any different? Perhaps, machine can
perform many tasks with great precision and effectiveness yet machine could not be
treated more intelligent than humans. This is because of the fact that it is highly
complex to develop micro level phenomena such as mood, mentality, emotions etc in
machines.
Composition of poem by an AI computer is a common phenomenon. This fact
created different orientations from different researchers and they insisted that machine
possesses intelligence. Contrary to this, another group of researchers have different
opinion they think that machine has accomplished this task in a mechanical manner
without understanding the real essence of the poem. In fact, processing of a program
which is based on a successful algorithm does not implicate that understanding is
taking place. Therefore, it is obvious that just composing a poem does not implicate
that machine has intelligence.
Indeed, philosophy reveals the fact that the intelligence of human being can not be
generated through a mechanical process rather its impetus is consciousness. Further,
within the purview of Indian philosophy the brain is treated as non-intelligent and
consciousness uses the brain as a computing instrument.
For decades, researchers did believe that computer can be intelligent provided
when it is powerful. However, the author is individually not agreeing with this view.
18 S.C. Pandey / Mind, Machine, and Image Processing
Indeed, brain and computers perform substantially in different manner. It is true that
brain constitutes a network of neurons and it can perform variety of tasks in efficient
manner and it also requires a huge quantity of memory to model the world. Brain is
able to predict the future event with great accuracy with the help of this memory based
model. Undoubtedly, future prediction is the important attribute of intelligence. It is
also obvious that simple neural networks can create intelligent more easily and
successfully in comparison to computer program.
It is author’s view that machine can perform action intelligently but author doubt
that a machine percept intelligently and philosophically. Therefore, from this
philosophical kaleidoscope, machine can be asserted non-intelligent. Indeed,
intelligence is a component of the subtle material entity. Further, the material entity
that composes this world is highly complex and infinite and it is impossible to
understand all these by finite intellect. Perhaps this is the reason we try to classify and
categorize this material entity in smaller granules to make it lucid, comprehensive and
easy to imbibe in our finite intellect. Therefore, it is vacuous to attempt to make a
machine as intelligent as human brain because the human brain in itself is not
absolutely intelligent. Perhaps, this may be the reason that we could not develop any
machine which possesses intellect equivalent to human being.
The author would like to incorporate that philosophically there are amazing variety
of ways in which one may approach the true consciousness and intelligence. However,
it is impossible for any effort of logical reason to percept the true scenario of ultimate
reality. The philosophical perspective of metaphysics entails that no manifestation is to
be taken as absolute true, while from the stand point of experience, every one of them
has some validity. Therefore, this is the author’s view that whatever we trying to
inculcate in an agent is not “intelligence” in true sense. Indeed, the pure and
transcendent intelligence is different from scientific intelligence, though it is not
discontinuous from it. Further, every science expresses in its own fashion and within a
certain order of things and thus reflects the higher immutable truth in which everything
of any reality must be incorporated. Scientific or discriminative intelligence prepares us
for the higher wisdom. However, scientific intelligence when inculcated inside the
machine, it displays very limited and confined behavior. The miscellaneous limitations
of AI are discussed philosophically in the forth coming section.
decades, AI is trying to enrich the machine with the ability to understand observations.
No doubt the progress in this pursuit is substantial yet the practical ability of machine
to understand observations is still very limited. In fact, researchers are trying to solve
the related multidisciplinary scientific problems. The systems capable of solving the
multidisciplinary scientific problems are termed as “cognitive systems” and this
characteristic is paramount for the development of intelligent machine. The concept of
interpretation is considerably advantageous in the realm of image processing. Indeed, it
paves the way for the application of syntax and semantics of propositional and first
order predicate logic. In image processing image observations is often consider as an
instance of formal expressions and semantics reveals the relationship between
expressions and their meaning. It is explicit that interpretations of images require high
degree of AI and interpretations can be treated as an instance of semantics.
Further, second reason behind this question is noise. In fact, noise is an
unavoidable fact in any measurement of the real world and machines are not properly
equipped with AI to deal with these short comings. Noise incorporates certain degree
of uncertainty in the image and thus it demand different mathematical tools such as
fuzzy logic and probability theory to deal the uncertainty involved. Presence of noise
also renders the analysis of images stochastic in nature. Moreover, application of
complex mathematical tools makes the entire process of image analysis complex in
comparison to deterministic method used for the analysis. It is also obvious that
accuracy of image analysis will not be more in case of stochastic analysis. In addition,
application of complex mathematical tool for noise removal may also require high
degree of AI and machine learning. Thirdly, in the image processing task the machine
has to tackle huge quantity of data associated with image and video and it requires
considerable degree of machine perception, machine cognition, and AI. Furthermore,
the real time processing of huge data is a challenging task and it requires suitably
designed algorithm besides intelligence, perception, cognition, and hardware support.
Fourthly, the brightness measurement is an important task in the image analysis. Indeed,
the brightness and intensity of the image depends on many factors such as type and
position of the light source, the position of the observer relative to the source etc.
Image brightness also depends on the surface and geometry of the object as well as
reflexive properties of the object’s surface. The brightness measurement is also
stochastic in nature and it requires a machine with substantial intelligence. Further, as
on date the AI of machine is sufficiently less than the human intelligence which in turn
render the image analysis difficult. Moreover, in image processing the phenomenon of
image understanding can be envisage as an attempt to find a relation between input
image(s) and previously established models of the observed world. It is to some extent
similar to the working of associative memory neural network. Further, in the
phenomenon of image understanding the transition from the input image data to the
previously established model of the observed world substantial quantity of relevant
information is lost. In fact, the process of image understanding is performed in many
steps and several levels of image representation are used. The lower most layers entail
raw image data whereas the upper most levels attempt to interpret the data. The
artificial intelligence of machine plays vital role in this intermediate representation of
image.
20 S.C. Pandey / Mind, Machine, and Image Processing
Image
understanding
Level 4 Object
Substantial domain
D: From feature extraction to object
Of AI application
Image with
Level 3 features
Feature
extraction
Digital
Level 2 image
2 D image
Object or a
Level 1 scene
The machine vision designs these intermediate representations of the images. It is the
crucial role of AI to establish and maintain relations between entities within and
between layers.
Data organization is an important task in machine intelligence and it is of the
paramount importance in the classification of image representation. Four possible
levels of image representation with the domain where artificial intelligence has
substantially been used are shown in Figure 3. Figure 3 renders a bottom up approach
of the information processing. This bottom up approach entails the flow of information
from low level i.e., from signals which have zero or least abstraction to high level
abstraction. It is indispensable to mention that this transition of low level abstraction to
high level abstraction requires substantial degree of AI. It is also pertinent to mention
that the flow of information is not necessarily unidirectional. Very often feedback loop
is also introduced and it creates the requirement of suitably designed machine learning
algorithm. Further, image processing can be broadly classified into two categories.
These are:
The low level image processing generally requires little knowledge and
intelligence. Different tasks performed in low level image processing are image
compression, pre-processing of images, edge extraction, image sharpening etc.
Moreover, low level image processing is mainly associated with the input images.
Perhaps, the first step in the low level image processing is digitization of the image and
thus machine must possess a suitably designed machine learning algorithm for this
purpose. In contrast high level image processing requires high degree of AI in machine.
High level image processing is based on knowledge, goals, and plans of how to achieve
those goals. Further, AI methods are widely used for high level image processing.
Moreover, high level image processing also requires artificial cognition which mimics
the functionality of human cognition. Indeed, high level image processing starts with
some form of formal model of the world, and then the “reality” perceived in the form
of digitized images is captured to the model [78]. Furthermore, machine vision is
expected to solve very complex tasks and thus it is formidable to have sufficient
potential of AI in the machine.
Perhaps, image processing cannot be accomplished without image segmentation. It
is a core step of image processing in which machine tries to separate objects from the
image background and from each other. In addition, image segmentation can be total or
partial as per the requirement. However, total segmentation is possible only for very
simple tasks but in case of partial segmentation AI plays vital role. Low and high-level
image processing differs in many respects. One important differentiation between these
two is that they both use different data. Low- level image processing use the data
associated with original image whereas in high level image processing, the data
originates in the image as well, but only those data which are relevant to high-level
goals are extracted and thus reducing the data quantity considerably. Therefore, it is
obvious that machine must entail the various characteristics of AI and cognition for this
purpose. The ordering of low-level steps to solve a specific task is a complicated and
open problem. The automation of ordering of low-level steps has not yet been
achieved. This task is even today completed by human operator. In this task the human
operator obtain the domain specific knowledge. Moreover, the uncertainty involved
22 S.C. Pandey / Mind, Machine, and Image Processing
depends on the operator’s intuition and on his previous experience [78]. Furthermore,
image understanding is of the paramount importance in high-level image processing
and the role of AI formidable in this pursuit. Perhaps, this task is very complicated and
computationally intensive.
It is obvious that the role of AI is of the paramount importance for the task of image
processing. Further, whatever the restriction of AI might be will also be render as the
limitations of image processing. It is intuitive fact that provably whatever is imagined
and hypothesized cannot be happen in AI. This fact is also true for the task of image
processing. Indeed, the functioning of AI and thus of the image processing is always
associated with some inevitable and inherent limitations. One such limitation is given
in [79]. Mathematical explanation of these limitations is hard. However,
philosophically these limitations are easily explainable. In fact, any scientific theory
commenced and based on a set of assumptions. This is called axioms and these axioms
are self-evident truths. The subsequent proof of any theorems relies upon these axioms.
The Godel’s first completeness theorem given in [79] says:
“If an axiom system is consistent, then it cannot verify its own consistency”.
S.C. Pandey / Mind, Machine, and Image Processing 23
The influence of this law is severe and it causes adverse implications on AI. It
implicates that if we design a machine for image processing, then that machine cannot
justify its own consistency. It is due to the fact that robot has been designed and
developed by means of a set of scientific axioms. Further, as per this theorem, human
mind is not just a system which suppose to work on as set of axioms. Perhaps, the
functioning of human mind is beyond the axioms. Therefore, it is explicit that machine
cannot usher human mind in image processing.
7. Conclusions
Mind is closely associated with the learning process and can perform numerous tasks
including image processing. Learning processes are also associated with machines to
perform the image processing tasks vividly. The learning process for image processing
requires establishing connections between syntactic and semantic properties so that
feature extraction and image understanding can be done precisely and efficiently. In fact,
learning process for the image processing of the machine mimics the mental process
during which information processing is taking place in human mind for the very purpose
of image processing. Indeed, the philosophical question that ‘can machine be
intelligent?’ or ‘can machine think and act like human being or other biologically
intelligent agent are discussed in the light of images and there processing. It is highly
complicated and skeptical to believe that machine can process the images as efficiently
as human mind. After all, machine is a man made physical object. The Gödel’s first and
second theorem which deals with consistency also supports this ideology. In fact, a
machine could have a mind or does machine possesses intelligence is still a skeptical
discussion. However, it can be asserted that machine can be as intelligent as human
being if it can achieve the feature of self-learning and self-improving process and then
machine would be supposed to perform the task of image processing similar to human
beings. These two are the essential attribute of the intelligence. The trail of advancement
in science and technology which have been observed in past are emanating the
possibilities that machines could perform the task of image processing excellently like
human mind will be a reality in near future.
References
[1] Feigl, H., The 'Mental' and the 'Physical', University of Minnesota Press, 1958, reprinted 1967.
[2] David J. Chalmers, Philosophy of Mind: Classical and contemporary readings, Oxford University
press, 2002.
[3] Ryle, G., The concept of mind, Hutchinson and Co., 1949.
[4] Armstrong, D. M, A Materialist Theory of the Mind, Routledge and Kegan Paul, 1968.
[5] Fodor, J., Special sciences: Still autonomous after all these years, Philosophical Perspectives 11(1997),
149-63.
[6] Kim. J., Mind in a Physical World, MIT Press, 1998.
[7] McCarthy, John, and Hayes, Patrick J., Some Philosophical Problems from the Standpoint of Artificial
Intelligence. In: Meltzer, B. and Michie, D., Machine Intelligence 4, Edinburgh University Press, pp.
463-502,1969. .
[8] Winston P.H., Artificial Intelligence, Addison-Wesley, 1992.
[9] Rich E., and Knight K., Artificial Intelligence, McGraw Hill Text, 1991.
24 S.C. Pandey / Mind, Machine, and Image Processing
[10] Penrose Roger, The Emperor’s New Mind: Concerning computers, Minds, and law of Physics, Oxford
University Press, Oxford, 1989.
[11] Warren S. Mcculloch, Walter Pitts, A logical calculus of the ideas immanent in nervous activity,
Bulletin of mathematical biology, Vol. 52, No. ½, pp. 99-115, 1990. Pergamon Press plc, Great Britain,
1990. Reprinted from the Bulletin of Mathematical Biophysics, Vol.5, (1943), 115-133, 1943.
[12] Norbert Wiener, Cybernetics or control and communication in the animal and the machine, second
edition, MIT Press, Cambridge, Massachusetts, 1985. Reprinted from 1948.
[13] C.E. Shannon, A Mathematical Theory of Communication, The bell system technical journal, Vol. 27,
(1948), 379-423, 623-656, July, October, Copyright 1948 by American telephone and telegraph co.,
USA.
[14] Newell A., Shaw J., and Simon H. A., Empirical explorations of the logic theory machine: A case study
in heuristics, Proceedings of the 1957 western joint computer conference February 26-28, pp. 218-230,
1957.
[15] Newell A., Shaw J.C., Programming the logic theory machine, Proceedings of the 1957 Western Joint
Computer Conference, February 26-28, pp. 230-240, 1957.
[16] Rosenblatt F., The perceptron: A probabilistic model for information storage and organization in the
brain, Psychological Review, 65 (1988), 386-407.(Reprinted in Neuro-computing (MIT Press, 1988).
[17] John McCarthy, , Recursive functions of symbolic expressions and their computations by machine,
MIT, 1960. http://edge.cs.drexel.edu/regli/Classes/Lisp_papers/McCarthy-original-LISP-paper-
recursive.pdf
[18] Herbert A. Simon, Allen Newell, Human problem solving, American psychologist, prentice Hall, 1972.
[19] Pamela McCorduck, Machines who think, A personal inquiry into the history and prospect of artificial
intelligence, San Francisco, California, Freeman, pp. 375, 1979.
[20] Gardner, H., The mind’s new science, New York: Basic Books, 1985.
[21] Russel S., and Norvg P., Artificial Intelligence: A modern approach, Prentice Hall, Second edition,
2002.
[22] Turing, A.M., Computing machinery and intelligence, Mind, 59 (1950), 433-460.
[23] Brey, Philip, Hubert Dreyfus, Humans versus computers. To appear in: Achterhuis, H. (ed.), American
philosophy of technology: The empirical turn , Indiana University Press, 2001.
[24] Azriel Rosenfeld, Picture Processing by Computer, New York: Academic Press, 1969.
[25] SuezouNakadate, ToyohikoYatagai, and Hiroyoshi Saito, Electronic speckle pattern interferometry
using digital image processing techniques, Applied Optics, 19,Issue 11 (1980),1879-1883.
[26] William H. Carter San Diego, Evaluation of Peak Location Algorithms With Subpixel Accuracy For
Mosaic Focal Planes, Processing of Images and Data from Optical Sensors,Conference Volume 0292,
1981.
[27] David W. Robinson, Automatic fringe analysis with a computer image-processing system, Applied
Optics, 22, Issue 14 (1983), 2169-2176.
[28] S. V. Ahamed ,V. B, An image processing systemforeye statistics from eye diagrams, Lawrence IAPR
Workshop on CV- SpealHarclware and Industrial Applications October 12-14. 1988. Tokyo.
[29] P. K. Sahoo, S. Soltaniand A. K. C. Wong, A Survey of thresholding Techniques, Computer vision,
graphics, and image processing, 41(1988), 233-260.
[30] Marc Antonini, Michel Barlaud, Image Coding Using Wavelet Transform, IEEE transactions on image
processing, vol. 1,no.2.APRIL 1992.
[31] Patnaik, S.Pal, R.N, Image compression using auto-associative neural network and embedded zero-tree
coding, IEEE Third Workshop on Wireless Communications (2001) 388-390.
[32] Shanhui Sun Christian Bauer and ReinhardBeichel, Robust Active Shape Model Based Lung
Segmentation in CT Scans, LOLA11 Challenge pp.213 -223, 2011.
[33] Sonal, Dinesh Kumar , A study of various image compression techniques.
www.rimtengg.com/coit2007/proceedings/pdfs/43.pdf
[34] Suzuki K, Abe H, MacMahon H, Doi K, Image-processing technique for suppressing ribs in chest
radiographs by means of massive training artificial neural network (MTANN), IEEE Transactions on
Medical Imaging., vol. 25,no.4, (2006), 406-416.
[35] WeixingWang,Shuguang Wu, A Study on Lung Cancer Detection by Image Processing, international
conference on Communications,Circuits and Systems Proceedings,( 2006) 371-374.
S.C. Pandey / Mind, Machine, and Image Processing 25
[36] Md. FoisalHossain, Mohammad Reza Alsharif, Image Enhancement Based on Logarithmic Transform
Coefficient and Adaptive Histogram Equalization, International Conference on Convergence
Information Technology, 21-23 November,(2007), 1439–1444.
[37] Wenhong Li,Yonggang Li,KexueLuo, Application of Image Processing Technology in Paper Currency
Classification System, IEEE transactions 22-24 Oct. (2008), 1-5.
[38] Li Minxia, ZhengMeng A Study of Automatic Defects Extraction of X-ray Weld Image Based on
Computed Radiography System, International Conference on Measuring Technology and
Mechatronics Automation - ICMTMA, 2011.
[39] Kanwal, N. , Girdhar, A. ; Gupta, S “Region Based Adaptive Contrast Enhancement of Medical X-Ray
Images”, Bioinformatics and Biomedical Engineering, (ICBBE), 5th International Conference, (2011),
1-5.
[40] Noorhayati Mohamed Noor, Noor Elaiza Abdul Khalid, Fish Bone Impaction Using Adaptive
Histogram Equalization (AHE), Proceedings of the Second International Conference on Computer
Research and Development , IEEE Computer society Washington, (2010),163-167.
[41] Lu Zhang, Dongyue Li, ShuqianLuo ,“Information extraction of bone fracture images based on
diffraction enhanced imaging” International Conference of Medical Image Analysis and Clinical
Application (MIACA) 10-17 June 2010, pp. 106-108.
[42] Md.FoisalHossain, Mohammad Reza Alsharif, and Katsumi Yamashita“Medical Image Enhancement
Based on Nonlinear Technique and Logarithmic Transform Coefficient Histogram Matching”,
IEEE/ICME International Conference on Complex Medical Engineering July,2010, pp.13-15.
[43] HasanDemirel, Co-occurrence matrix and its statistical features as a new approach for face recognition,
Turk J Electrical Engeneering and Computer Society, vol.19, No.1(2011), 97-107.
[44] Pu J, Paik DS, Meng X, Roos JE, Rubin GD“Shape , Break-and-Repair” Strategy and Its Application to
Automated Medical Image Segmentation, IEEE transactions on visualization and computer graphics,
vol.17, no. 1, january 2011.
[45] TianShen, Hongsheng LiXiaolei Huang, Active Volume Models for Medical Image Segmentation,
IEEE Transactions on medical imaging vol. 30, Issue 3 (2011), 774–791.
[46] KimmiVerma, AruMehrotra, VijayetaPandey, Shardendu Singh, Image processing techniques for the
enhancement of brain tumour patterns, International journal of advanced research in electrical,
electronics and instrumentation engineering, 2(4) (2013),1611-1615.
[47] Pallavi T. Suradkar , Detection of Malarial Parasite in Blood Using Image Processing, International
Journal of Engineering and Innovative Technology (IJEIT) ,vol. 2, Issue 10, (2013), 124-126.
[48] Md.AmranHossenBhuiyan, Ibrahim Azad, Md.KamalUddi, Image Processing for Skin Cancer Features
Extraction, International Journal of Scientific & Engineering Research,Volume 4, Issue 2 (2013), 1-6.
[49] S.Kannadhasan, N.BasheerAhamed, M.RajeshBaba , Cancer Diagonsis with the help digital Image
Processing using ZIGBEE Technology, International Journal of Emerging Trends in Electrical and
Electronics (IJETEE), Volume 1, Issue 2, 2013.
[50] HardikPandit ,Dr. D M Shah, Application of Digital Image Processing and Analysis in Healthcare
Based on Medical Palmistry, International Conference on Intelligent Systems and Data Processing
ICISD,Special Issue published by International Journal of Computer Applications (IJCA), (2011), 56-
59.
[51] Newell, A. and Simon, H. A., Computer Science as Empirical Inquiry: Symbols and Search,
Communications of the ACM 19 (1976), 113–126.
[52] Putnam, H., Minds and machines, in S. Hook (ed.), Dimensions of Mind, New York University Press,
pp. 138–164, 1960.
[53] Haugeland, J, AI: The Very Idea, MIT Press, 1985a.
[54] Searle John. R., Minds, brains, and programs, Behavioral and brain sciences, 3(3) (1980), 417-457.
[55] Joseph Weizenbaum, Computer Power and Human Reason: From Judgment to Calculation, W. H.
Freeman, 1976.
[56] David L. Anderson, Introduction to functionalism, Consortium on cognitive science instruction, 2003.
http://www.mind.ilstu.edu/curriculum/functionalism_intro/functionalism_intro.php
[57] Rodney Brooks, Flesh and Machines: How Robots Will Change Us, Vintage, Reprint edition 2003.
[58] Hubert L. Dreyfus, Body and world: a review of What Computers Still Can’t Do: A Critique of Arti cial
Reason , MIT Press, Cambridge, MA, 1992.
[59] Anderson, M. L., and Oates, T., A review of recent research in metareasoning and metalearning. AI
Magazine, 28(1) (2007), 7-16.
26 S.C. Pandey / Mind, Machine, and Image Processing
[60] Baars, B. J., A cognitive theory of consciousness, Cambridge University Press, 1998.
[61] Blackmore, S., There is no stream of consciousness, Journal of Consciousness Studies, 9(5-6) (2002),,
17-28.
[62] Chrisley R., Embodied artificial intelligence. [Review]. Artificial Intelligence, 149(1) (2003), 131-150.
[63] Clowes, R., Torrance, S., and Chrisley, R., Machine consciousness-embodiment and imagination.
[Editorial Material]. Journal of Consciousness Studies, 14(7) (2007), 7-14.
[64] Dennett D.C., Consciousness Explained, Penguin Press, 1993.
[65] Densmore, S., and Dennett D., The virtues of virtual machines. [Article]. Philosophy and
Phenomenological Research, 59(3) (1999), 747-761.
[66] Gamez D., Progress in machine consciousness, Consciousness and Cognition, 17(3)(2008), 887-910.
[67] Haikonen P. O. A., Essential issues of conscious machines, Journal of Consciousness Studies,
14(7)(2007), 72-84.
[68] Rolls E. T., A computational neuroscience approach to consciousness. [Review], Neural Networks,
20(9)(2007),, 962-982.
[69] Rosenthal D. M., The nature of Mind, Oxford University Press, 1991.
[70] Sloman, A., and Chrisley, R., Virtual machines and consciousness, Journal of Consciousness Studies,
10(4-5) (2003), 133-172.
[71] Sun R., Learning, action and consciousness: A hybrid approach toward modelling consciousness.
[Article], Neural Networks, 10(7)(1997), 1317-1331.
[72] Taylor, J. G., CODAM: A neural network model of consciousness, Neural Networks, 20(9)(2007), 983-
992.
[73] Velmans, M., Making sense of causal interactions between consciousness and brain, Journal of
Consciousness Studies, 9(11)(2011), 69-95.
[74] Velmans, M., How to define consciousness: And how not to define consciousness, Journal of
Consciousness Studies, 16(5) (2009), 139-156.
[75] Crevier D., AI: The tumultuous search for artificial intelligence. New York: Basic Books, 1993.
[76] Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W., Yu, J., and Lanza, G., Genetic programming
IV: Routine human-competitive machine intelligence, Kluwer Academic, 2003.
[77] Schaeffer, J., Burch, N., Bj¨ornsson, Y., Kishimoto, A., M¨uller, M., Lake, R., Lu, P., and Sutphen, S.,
Checkers is solved, Science. 19 July, 2007.
[78] Milan Sonka, Vaclav Hlavac, Roger Boyle, Image Processing, Analysis, and Machine Vision,Cengage
learning, 2008.
[79] Godel K., Under formal unentscheidbare satze der principia mathematica und physic, 38 (1931), pp.
173-198. Translated in English by Van Heijenoort: From Frege to Godel, Harvard University Press,
1971.
Deep Learning for Image Processing Applications 27
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-27
Abstract. This chapter gives an insight into Deep Learning Neural Networks and
their application to Image Classification / Pattern Recognition. The principle of
Convolutional Neural Networks will be described and an in-depth study of the
algorithms for image classification will be made. In artificial intelligence,
machine learning plays a key role. The algorithm learns when exposed to new
data or environment. Object / Pattern Recognition is an integral part of machine
learning and image classification is an integral part of such algorithms. The
Human Visual System efficiently classifies known objects and also learns easily
when exposed to new objects. This capability is being developed in Artificial
Neural Networks and there are several types of such networks with increasing
capabilities in solving problems. Neural networks themselves have evolved from
evolutionary computing techniques that try to simulate the behavior of the human
brain in reasoning, recognition and learning. Deep neural networks have powerful
architectures with the capability to learn and there are training algorithms that
make the networks adapt themselves in machine learning. The networks extract
the features from the object and these are used for classification. The chapter
concludes with a brief overview of some of the applications / case studies already
published in the literature.
1. Introduction
Machine Learning (ML) is a part of Artificial Intelligence (AI) that enables computer
algorithms to learn from input / training data [1]. Learning can be categorized as
supervised, unsupervised or semi-supervised. If the learning happens with a set of
known (training) data and the algorithm is trained for a particular task, it is supervised
learning [2]. If the learning happens with unknown input data, it is unsupervised
learning. Semi-supervised learning falls in between these two categories. In
supervised learning, the training inputs and the desired outputs are given and the
relation of input to output is learnt by the algorithm. The mapping between the input
and output are already known. In unsupervised learning, the inputs are given and the
algorithm learns to find patterns or features to produce the output. The number of
1
Corresponding Author.
28 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
outputs need not be known earlier by the algorithm. In semi-supervised learning, the
training inputs and desired outputs are partially given and the algorithm learns to find
the missing relations and patterns.
Neural Networks (NN) play a key role in the development of machine learning
algorithms in AI. NNs have been developed, inspired by the working of biological
neurons in the human brain. Each neuron is a processing element and an
interconnection of such neurons leads to huge computing power that can solve complex
tasks. Warren McCulloch and Walter Pitts (1943) created a computational model for
neural networks based on mathematics and developed algorithms called threshold logic.
This model paved the way for neural network research to take two approaches - one for
biological processes in the brain while the other for application of neural networks to
artificial intelligence. Neural networks have been developed primarily for solving
complex problems that might not fit in with rule based algorithms or programs. In real
world problems, some data might be missing or all the rules might not be known.
Heuristics play a large role in artificial intelligence and hence in neural networks.
Artificial Neural Networks (ANNs) are computing systems inspired by
the biological neural networks that constitute human and animal brain. An ANN is
constructed from a collection of connected units called artificial neurons. Each
connection between neurons can transmit a signal from one to another. The receiving
neuron can process the signal(s) and then transmit to downstream neurons connected to
it. Neurons have states, generally represented by real numbers, typically between 0 and
1. Neurons also have weights that vary as learning progresses, which can increase or
decrease the strength of the signal that it sends downstream to other neurons. Further,
they might have a threshold such that only if the aggregate output is below (or above)
that threshold level the downstream signal is sent. Typically, neurons are organized in
layers. Different layers perform different kinds of transformations on their inputs.
Signals travel from the first (input) layer, to the last (output) layer, possibly after
traversing the layers in between them multiple times.
Such neural networks learn to do tasks, with progressive improvement in
performance, by considering examples, generally without task-specific programming.
They have found more use in applications difficult to express in a traditional computer
algorithm using rule-based programming. The original goal of the neural network
approach was to solve problems in the same way that a human brain would. Over time,
attention focused on matching to specific mental abilities, leading to deviations from
biology such as back propagation, or passing information in the reverse direction and
adjusting the network to reflect feedback information. Neural networks have been
designed for a variety of tasks, including computer (machine) vision, speech
recognition, pattern / object recognition, image classification, machine
translation, social network filtering, playing video games, medical diagnosis and in
many other domains.
Neurons are organized in layers with interconnections between the layers. Signals
are transferred across the layers, after processing. Mostly, signals travel from the input
layer to the output layer in the forward direction and such networks are called
feedforward networks. In certain networks, signals also travel in the reverse direction,
leading to recurrent networks and the development of the back propagation algorithm.
All layers of neurons other than input and output are called ‘hidden’ and their number
varies depending on the problem to be solved. When the number of hidden layers is
more than two or three, it becomes a deep neural network. Such deep neural networks
have the capacity to learn to solve problems and the learning is called deep learning.
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 29
Such deep neural networks and learning algorithms are developed for solving complex
tasks such as machine vision, pattern recognition, object recognition / identification,
surveillance, image classification, character recognition, speech / speaker recognition,
etc.
Image Classification is one of the toughest tasks to be learnt and solved by an
algorithm. When several images are available, the human brain learns and classifies
existing as well as new images with almost 100% accuracy. Machine learning
algorithms have been developed to mimic exactly this behavior of human brain. The
problem becomes more difficult when the images are taken under different conditions
such as change in illumination, rotation or translation of objects in the image,
incomplete or hidden objects, different postures if the object is a face, etc. Such
conditions lead to hundreds of different images (image sets) containing the same object,
adding to the complexity of the recognition / classification problem.
Many algorithms have been developed in the past for image classification such as
Minimum Distance Classifier, k-means clustering, Bayes Classifier, Maximum
Likelihood Classifier, Support Vector Machines (SVM), Principal Component Analysis
(PCA), Kernels, etc. These algorithms are based on a fixed logic and they cannot learn.
They are either parametric or non-parametric. If the classification algorithm requires
the computation of statistical parameters such as mean, covariance, etc., it is
parametric; otherwise it is non-parametric. Parametric classifiers require training
samples / data from which the parameters have to be estimated. More number of
training data, more accurate the parameter estimation will be.
Image classification can be pixel-based or object-based. In pixel-based
classification, the attributes of every pixel is extracted to label it as belonging to a
particular class. In object-based classification, segmentation is done to extract regions
or objects in the image and their attributes evaluated. Features or attributes have to be
extracted in order to do classification. The efficiency of the algorithm depends on the
number of features used in the process. This leads to the problem of ‘curse of
dimensionality’. Dimensionality reduction has to done which is equivalent to feature
reduction and hence reduced time consumption. More the number of features, more
processing and storage of data are required. This increases the time complexity of the
algorithm. The algorithms that classify objects with minimum number of features with
less time consumption are more efficient.
Neural Networks are also developed for image classification and they are
inculcated with the capability to learn. A set of training data is available and
relationships between training inputs and the desired outputs (or pattern classes) have
to be fed into the network. The network gets trained on these known data and learns to
recognize / classify new data. Machine learning algorithms have been developed to
make learning happen, with the logic being learnt by the algorithm during training.
The data or patterns are given as inputs, the outputs are specified and the algorithm
learns to find the relationship between inputs and outputs. When the problem is
complex such as in image classification, more hidden layers are required and this
makes the neural network ‘deep’. Hundreds of hidden layers improve the accuracy of
the classification and the learning becomes ‘deep learning’.
Deep Learning Neural Networks are covered in Section 2. Image Classification is
discussed in Section 3. Some of the applications of Deep Neural Networks to image
classification problems are briefly described in Section 4. The chapter concludes with
Section 5, followed by the bibliography.
30 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
Machine learning algorithms generate their own logic based on the input data. The
algorithm learns by itself and code need not be written to solve every problem. Typical
example is categorization of emails into various bins such as input, spam, etc. Another
important classification is for objects present in images. For example, an image
contains the picture of an animal. The problem is to categorize the animal as deer, dog,
cat, lion, etc, which is equivalent to object / pattern recognition. Large number of
images with pictures of animals has to be fed as input to the algorithm from which it
can learn to classify. If the images are already classified and fed, it is supervised
learning. If not, it is unsupervised learning. The simplest case is that of classifying a
pattern into two classes such as identifying whether the animal in the picture is
elephant or not. Then it becomes a two class problem. Basically, machine learning
algorithms classify based on patterns in the data. The data can be text, sound, image,
etc.
Neural networks have been found to be best suited for implementation of machine
learning algorithms. Traditional neural networks have one input layer, one output layer
and two or three hidden layers. Deep neural networks have one input layer, one output
layer and hundreds of hidden layers, typically as shown in Figure 1 [3]. More number
of hidden layers, deeper the network. The layers are interconnected, with the output of
the previous layer being the input of the current layer. The inputs / outputs are
weighted, and the weights determine the performance of the network. Training of the
network involves obtaining the appropriate weights for the various layers. Deep
networks require higher processing power, computing speed, large database and the
appropriate software with parallel processing.
Convolutional Neural Network (CNN) is a type of deep learning network that has
become popular for image classification. Three examples of CNN architectures are
given in Figure 2 [4], Figure 3 [5] and Figure 4 [3]. It consists of an input layer and
hundreds of feature detection layers. Feature detection layers perform one of the
following three operations: Convolution, Pooling, Rectified Linear Unit (ReLU).
Convolution puts the image through convolution filters that activate certain features in
the image. Pooling performs non-linear down sampling in order to reduce the number
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 31
of data to be handled. Rectification Linear Unit maintains positive values and maps
negative values to zero. The classification layer is the one before the output layer. It is
a fully connected layer with N-dimensional output, N being the number of classes to be
categorized. This layer outputs a N-dimensional vector, each element of the vector is
the probability that the input image belongs to one of the N classes. The final output
layer uses a softmax function to give the classified output. To produce accurate results,
thousands or millions of images have to be fed into the network. It requires higher
computing power with several Graphics Processing Units (GPUs) operating in parallel.
When millions of training images are fed into the deep learning network, the
network starts to identify features of the data that can be used in categorization.
Processing takes place at each layer of the network and this is fed to the next
consecutive layer. CNNs have been designed based on biological structure of the
visual cortex. The simple and complex cells of visual cortex activate based on the
32 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
subregions of a visual field, called receptive field. The neurons of a layer in CNN
connect to the subregions of the previous layer instead of being fully connected. The
neurons are not responsive to other subregions. The subregions are allowed to overlap
and hence produce spatially correlated outcomes, unlike traditonal neural nets. This is
the fundamental difference between CNN and other neural nets. The CNN reduces the
number of parameters to be handled by reducing the number of connections, sharing
the weights and by downsampling.
The input to the CNN will be an image containing one or more objects to be
classified. The number of input values will depend on the image size and the depth of
the pixels (3 in the case of RGB). These are simply numbers that have to be translated
or identified as objects belonging to a certain class. CNN tries to imitate the human
visual cortex that has sensitivity to certain regions of the visual field in small regions of
cells. Some of the neurons in the brain are sensitive to certain content in the visual
images, for example, edges, curves, etc.
The first layer in CNN is the convolution layer that performs spatial convolution of
a pre-defined mask with the pixel values. This is equivalent to linear filtering operation
and the convolution output depends on the pre-defined mask. For example, the image
size might be 256 x 256 x 3 and the mask size could be 5 x 5 x 3; so it has to multiply
the pixel values and mask values (element by element multiplication) and sum it up to
get the convolved output at the position of the mask placed over the image. The mask
is moved over the image from left to right and top to bottom and the convolution is
repeated to cover the entire image. The shape that is to be detected determines the
mask values so that higher convolved values are obtained if the shape is present in the
image at that location. This produces a feature map which will be another array of
numbers, as shown in Figure 6 [6].
When the activation output (feature map) of the first layer is input to the second
hidden layer and the convolution operation is repeated, another feature map that is at a
higher level than the first is obtained. This process produces different activation maps
for different features in the image, including complex features.
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 33
Figure 6. Convolution in a CNN mapping from input to hidden layer [Cited from Ref. 6]
There is a fully connected layer at the end of the network that takes its input from
the preceding layer and produces N-dimensional output. For N-class problem, it could
possibly output N different probability values, representing the probabilities that the
object could belong to that class. In the image containing bird as shown in Figure 5,
the probability for class bird could be 0.72, for class dog could be 0.17, for class flower
could be 0.15, etc. This is the softmax approach, contrary to the hard approach where
the probability of one of the classes is 1 and all others are 0. The CNN needs to be
trained by the back propagation algorithm on millions of images for the classification
to be accurate.
Each convolution layer is followed by an activation ReLU (Rectified Linear Unit)
Layer that makes the operation non-linear. It makes the negative activation values zero
and retains the positive activation values. The ReLU layers are followed by pooling
layers, popular being the max pooling layer, that takes the activation inputs and
produces a down sampled output. For example, if the layer is max pooling and the
filter size is 2 x 2, it takes the maximum value from the 2 x 2 activation region and the
composition from all the regions is the array output of this layer. Instead of maximum
value, other possibilities such as averaging are also done. There are also dropout layers
that drops out certain activation outputs during the training phase of the network. This
prevents overfitting.
Several software are available to design and implement deep neural networks. Some of
them are open source and an overview of the available software is outlined.
Tensorflow [7, 8] is an open source software that has been developed specifically
for machine learning. Its built-in functions ease the design of deep neural networks in
a modular manner. The software also adapts to the use of multiple Central Processing
Units (CPUs) / GPUs in execution. The Application Programmer Interface (API) is
built in layers that provide control to the user at different levels.
R [9, 10, 11] is a programming environment composed of software utilities that are
available at no cost. It is a GNU that provides functions, libraries, debuggers and other
modules that make things easier for algorithm development. R follows a graphical
34 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
approach and provides a set of tools for data storage, manipulation, analysis and other
computations. It is an extension of S Language that was developed earlier and most
of the codes in S run under R.
MATLAB [12, 13]: Deep learning requires substantial amount of data and
computations. So it requires high computing power with parallel architectures. The
neural network and other allied toolboxes in MATLAB allow users to create, train and
use neural networks for various applications. The allied toolboxes facilitate the image
to be acquired, input to the neural network and the processed output to be displayed.
Caffe [14, 15, 16, 17] is a deep learning framework that has been designed with
C++, Python and MATLAB. It is an open source framework from UC Berkeley that let
researchers design and train deep neural networks. Caffe has an expressive architecture,
extensible code and high speed that has created a community of users for industrial and
research applications.
Torch [18, 19] is an open source software for machine learning based on LuaJIT
script with C implementation. It has a library that can be used for developers for
applications in deep learning. The core object of the library is Torch that is tensor
based. Complex neural networks can be built with ease and flexibility with high
execution speed.
Microsoft Cognitive Toolkit [20, 21, 22] allows users to develop, train and apply
their own neural networks. It is an open source library for deep learning supported by
Windows and Linux. It runs in different environments – from CPUs to GPUs and it is
compatible with existing programming languages.
Keras [23, 24, 25] is a high level neural network programming API intended for
fast development of algorithms and applications. It can run on top of Theano,
Microsoft Cognitive Toolkit (CNTK) and Tensorflow, seamlessly on CPUs and GPUs.
It is a collection of modules placed in a library developed in Python.
Deeplearning 4j [26] is a programming library written for Java. The algorithms
include programs for various types of deep neural networks. The basic neural network
layers are stacked on top of each other to create deep networks. These networks have
artificial intelligence and can produce astounding results for complex problems that
were solvable only by humans until now.
MXNet [27, 28] is an open-source software for training and deploying deep neural
networks. It is scalable and flexible suitable for fast training, and supports
programming model and multiple languages
(C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl, Wolfram Language).
ConvNetJS [29] is a Javascript library for training Deep Neural Networks. It
allows the users to formulate and solve Neural Networks in Javascript and has been
extended since inception by contributions from the community.
Several CNNs have been implemented to classify thousands of pattern classes
[30]. The error rates achieved were lesser than the already existing versions of CNN.
One typical network has around 650,000 neurons and GPUs have been used in training.
Currently, deep neural networks are being trained on CPUs, GPUs and in the cloud.
CPUs are the cheapest option but they can be used only for pre-trained networks.
GPUs can be used for training the deep neural networks as well as for classification.
Cloud based resources can be utilized for both training and classification, without
procurement of expensive hardware.
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 35
One of the successful CNNs developed is AlexNet that has been trained on more
than million images. It was first published in 2012 and can classify various objects and
several breeds of animals, as shown in Figure 7. AlexNet can be used along with
webcam for image acquisition and the associated packages in MATLAB. It can also be
used to classify images for which it is not trained by transfer learning. ZF Net was
developed in 2013 by Zeiler and Fergus which is a modified version of AlexNet. VGG
net was created in 2014 by Karen Simonyan and Andrew Zisserman of the University
of Oxford with error rate less than the previous CNNs. GoogLeNet and Microsoft
ResNet were developed in 2015 with incredibly reduced error rates.
3. Image Classification
The input to the algorithm will be the pixels of the image that are numeric in
nature. The output will be a value or set of values indicating the class. The algorithm
will be a mapping function that maps the pixel values to the appropriate class. It can be
a single output multi-valued function or it can be multi-output single valued function to
indicate the class. Classification can be supervised or unsupervised. In supervised
classification, the number of classes is known and a set of training data with
information on their class is given. It is like learning from a teacher. In unsupervised
classification, the number of classes is not known and the training data are not available.
The relationship (or mapping) between the data to be classified and the various classes
has to be learned. It is like learning without a teacher. If some information is available
regarding the mapping of data to classes, then supervised and unsupervised methods
can be combined and it becomes semi-supervised.
Features are the most important parameters associated with input data based on
which the data are classified. Defining certain attributes of the object as features plays a
crucial role in classification. Extraction of features from the objects in the image is
used for classification.
Example:
• Satellite image is the input for which classification has to be done.
• Problem is to identify or classify the pixels into land, desert, crop coverage,
water, mountains, minerals, etc.
• Land cover can be classified into different themes called thematic mapping.
• Supervised learning: information on the relationship between pixel values in
different bands and the classes has to be known based on training data.
• Unsupervised learning: information on the relationship between pixel values
in different bands and the classes has to be learnt.
• Features: attributes of the pixels that could possibly be numerical values in
different bands of a multispectral image.
Examples of categorization of satellite images are given in Figure 8 [31] and
Figure 9 [32]. A mathematical function (discriminant function) has to be defined for
each categorization to which the inputs are the features and the output will be a
numerical value that indicates the class to which the data belongs. So it has be a multi-
valued function that maps the feature set to classes. One method of extracting features
is by doing a statistical analysis of the pixels to obtain parameters like mean,
correlation, covariance, etc. If the mapping information of the parameters with the
classes is known or expertise is available, then it is supervised; otherwise it is
unsupervised. Training data can be used in developing these relations or mapping.
Figure 10. Training Samples of Handwritten Digits [Cited from Ref. 33]
In Figure 11 the image of an animal is taken and it is classified into four different
categories with varying probabilities [34]. In Figure 12, images are given and the
classification has to be done under different illuminations, view points, etc. In
Figure 13, image set classification has to be done.
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 39
Figure 12. Classification from an image(s) under different conditions [Cited from Ref. 34]
Figure 13. Classification from an image set [Cited from Ref. 34]
40 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
Several classification algorithms are available and some of them are listed below:
• Minimum Distance Classifier
• Bayes Classifier
• Neural Networks
• Support Vector Machines (SVM)
• Principal Component Analysis (PCA)
• k-means algorithm
Classification is done based on the minimum distance between the unknown pattern
vector and the prototype pattern vector that represents the mean of a pattern class [35].
The distance measure is the Euclidean distance computed using the feature vectors.
Each pattern class has a set of features that have to be extracted to make pattern
matching or recognition easier and more accurate. The features are represented in
terms of numbers in order to make mathematical computations.
Let x be the vector representing the features of a pattern. Let xi be the pattern
vector of the ith pattern in a pattern class. Then xi = [xi1 xi2 xi3 …. xin], where n is the
dimension of the feature vector (number of features used to represent the pattern).
Hence xij stands for the jth feature in the ith pattern in the given pattern class. The mean
vector of a pattern class is computed to obtain the prototype vector of that class. It is
given by,
1 Nr
mr = ∑ xi , r = 1, 2, …. N (1)
N r i =1
where mr is the mean vector of the rth pattern class, Nr is the number of vectors in the
rth pattern class and N is the number of pattern classes. The Euclidean distance of a
given vector x with the mean of a pattern class is defined as,
Dr = x − m r r = 1, 2, …. N (2)
1
where x = ( x T x ) 2 is defined as Euclidean norm of vector x. The Euclidean distance
of pattern vector x to the mean of various classes is computed, as shown in Figure 14
and Figure 15. The pattern represented by x is assigned to class r if Dr is the smallest
distance.
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 41
Figure 14. Distance of pattern vector x to the mean of various pattern classes – Example 1 [Cited from
Ref.36]
Figure 15. Distance of pattern vector x to the mean of various pattern classes – Example 2 [Cited from
Ref. 37]
Bayes Classifier
⎛ω ⎞
p⎜ i ⎟ (3)
⎝ x ⎠
42 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
When the pattern x is wrongly classified as belonging to class ωj the loss incurred is
Lij . The average loss incurred by the input pattern x is when it is classified as
belonging to class ωj is,
N ⎛ω ⎞
r j ( x) = ∑ Lkj p⎜ k ⎟ (4)
k =1 ⎝ x ⎠
1 N ⎛ x ⎞
r j ( x) = ∑ Lkj p⎜⎜ ⎟⎟ p(ω k ) (5)
p( x) k =1 ⎝ ωk ⎠
The loss functions r1(x), r2(x) …. rN(x) are computed for the N pattern classes and
the pattern x is classified as belonging to the class j that has the minimum value among
all the rj(x). The pattern is classified as class ω j if,
N ⎛ x ⎞ N ⎛ x ⎞
∑ Lkj p⎜⎜ ⎟⎟ p(ω k ) ≤ ∑ L pi p⎜⎜ ⎟ p (ω p )
⎟
(6)
k =1 ⎝ ωk ⎠ p =1 ⎝ωp ⎠
⎛ x ⎞ ⎛ x ⎞
p⎜⎜ ⎟⎟ p(ωi ) > p⎜ ⎟ p(ω j )
⎜ωj ⎟ (7)
⎝ ωi ⎠ ⎝ ⎠
⎛ x ⎞
d j ( x ) = p⎜ ⎟ p(ω j )
⎜ωj ⎟
⎝ ⎠ (8)
j = 1, 2, ….. N, and the pattern x is assigned to the pattern class with the largest
decision value. The accuracy of the classification depends on the estimation of the
⎛ x ⎞
probabilities p⎜ ⎟ and for n-dimensional pattern vectors, the Gaussian distribution
⎜ω ⎟
⎝ j ⎠
is assumed.
Neural Networks
Neural networks [35] are composed of a network of neurons which are tiny processing
elements that are similar to biological neurons. The neurons are placed in an order
such that they form a layer. The perceptron is a simple neural network shown in
Figure 16. The input vector x = [x1 x2 x3 …. xn] is applied to the neurons forming the
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 43
input layer. The output of each neuron xk is weighted by a factor wk and they are
summed. The weighted sum of inputs is given to an activation function that gets
activated whenever the sum crosses a fixed threshold.
The response of this network is given by,
n
d(x) = wo + ∑ xk wk , (9)
k =1
where wo, w1, w2, …. wn are weights and d(x) is the linear decision function. The
output O = +1 if d(x) > 0 and O = -1 if d(x) < 0. The final output is determined by an
activation function which determines the threshold. This simple perceptron can be
used to differentiate between two pattern classes. The vector x is equivalent to the
feature vector of the given pattern and the output of the neural network can take one of
two levels, either +1 or -1. The two levels are indicative of two different pattern
classes. The weights can be varied which will vary the decision function, the
activation function can be varied which will determine the activation threshold of the
output, as shown in Figure 16.
This basic concept of pattern classification with a simple perceptron can be
extended further to higher level neural networks with more number of input nodes,
output nodes and hidden layers between the input and output. More neurons in the
output can differentiate more number of pattern classes. Hidden layers can help in the
design of the network to be more accurate in the classification process. The weights
can be modified by applying a training algorithm with patterns belonging to known
pattern classes. More the number of training data, better the network will classify
unknown input patterns.
The patterns can be differentiated into separable and non-separable classes. In the
separable case, the pattern classes can be easily differentiated with a decision boundary
separating them. In the non-separable case, the pattern classes are not exactly
separable and there is some amount of overlap between them that makes exact
separability difficult. Multilayer feedforward neural networks with layers of neurons –
44 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
one input layer, one output layer and multiple hidden layers – have been designed and
trained to classify patterns that are not linearly separable. The activation function can
be hard limiting step function or soft limiting sigmoid function. Different activation
functions can be used for different layers or for different nodes in the layer. The
squared error between the desired output and the actual output of the network is to be
minimized. The training of the network is done by back propagation that adjusts the
weights of the network such that the error in the output is minimum.
Figure 17(a) – (c). Hyperplanes separating two classes [Cited from Ref. 39]
Figure 18. Location of optimal hyperplane Figure 19. Equation of hyperplane [Cited from Ref. 40]
[Cited from Ref. 40]
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 45
The optimum location of the hyperplane will be such that the distance from the
hyperplane to the input vectors of each class should be maximum, as shown in
Figure 18. The input vectors closest to the hyperplane are called support vectors. The
location of the support vectors determines the hyperplane and the classification
accuracy of the algorithm. The separation between the hyperplane and the support
vectors determines the margin. For good classification, the margin should be
maximized, as shown in Figure 18 [40].
The hyperplane can be represented by the equation wTx + b = 0, where x
represents all the training input vectors lying on the hyperplane, b is the bias,
representing the displacement from the origin, w is a vector that is perpendicular to the
hyperplane and m is the functional margin of the hyperplane, as shown in Figure 19
[40]. For good classification and separability, all training vectors belonging to class 1
should satisfy wTx + b ≤ -1 and all training vectors belonging to class 2 should satisfy
wTx + b ≥ +1.
In 1995, Corinna Cortes and Vladimir N. Vapnik proposed a softmargin method
for maximum margin hyperplane where it is not possible to cleanly separate the two
classes. If the input vectors are not linearly separable, SVM allows the mapping of the
finite dimensional input vector space to a higher dimensional feature space that makes
separability easier. Kernel functions of the form K(x, y) are suitably selected so that
the dot product of the variables takes minimum computation load. In a higher
dimensional space, the hyperplane is defined as the one where the dot product of the
input vectors lying on the hyperplane with a vector in that space is constant. [41, 42]
PCA extracts the important information from the data and represents it in terms of
principal components that are orthogonal to each other [43, 44]. The pattern or
relationships between the various data can be extracted and visualized. The statistical
components are obtained from the data using Singular Value Decomposition (SVD).
PCA is a statistical technique that has wide range of applications such as object
detection, face recognition, image compression, pattern recognition, classification, etc.
When there is a large set of data to be analyzed, PCA can be applied for computing the
statistical parameters. Mean refers to the statistical average of a set of data. Standard
Deviation is a measure of the spread of the values of a given set around the mean.
Variance is also a measure of the spread of the data about the mean and it is square of
standard deviation. Covariance is a measure of the spread of data in two dimensions.
PCA can find patterns in data and by analyzing these patterns, dimensionality can
be reduced. The covariance matrix of the given set of data is computed after
subtracting the mean in each dimension. The Eigen vectors and Eigen values of the
covariance matrix are calculated. The Eigen vector with the highest Eigen value is the
principal component of the data set. Ordering the data in terms of Eigen values and
choosing the first p components from the set of n-dimensional data reduces the
dimension. The feature vector is constructed by forming a matrix with the Eigen
vectors forming the columns of the matrix. Finally the input data is expressed in terms
of these Eigen vectors instead of the original basis functions. This makes the data to be
expressed in terms of the patterns existing among them, thus making pattern
classification possible.
46 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
k-means algorithm
A set of data is given in an n-dimensional space Rn. The problem is to find k data
points among them that can be the centers of k clusters such that the mean squared
distance from any data point in the cluster to the cluster center is minimum. The k
cluster centers should be chosen such that the average distance among all the data
points should be minimized. Lloyd’s algorithm is one of the popular clustering
algorithms available to solve such problems. The k-means algorithm [45] is one of the
simple iterative techniques for finding the k cluster centers so that the squared error
distortion is minimized.
According to Lloyd’s algorithm, the optimum center of a cluster is the centroid.
Initially, a set of k points is identified, that is represented by z. The region around z is
called Voronoi region V(z). Each iteration of the algorithm computes the centroid of
the Voronoi region such that the mean squared error reduces. The algorithm stops
when the error reduces below a pre-defined threshold or the maximum number of
iterations is reached.
Some of the popular classification algorithms have been discussed.
Dimensionality reduction is another problem associated with classification. When
more number of features are extracted, they have to be properly represented and
processed. For example, in 7x5 image, the number of pixels to be processed for each
segmented region is 35. If there are 10 characters in the image, then 350 pixels have to
be processed in one component of the image. If it is a color image, then there will be
three components for each pixel (RGB) and the number of values to be processed will
be 350 x 3 = 1050. When there are thousands of training samples in a database, it
could lead to increased time and computational complexity. This complexity in
computations leads to the concept of feature reduction. Feature reduction refers to
extracting minimal number of features among the existing ones that are necessarily
required for classification. The representation of features also plays a major role in
dimensionality reduction.
4. Applications
In the paper by Munawar Hayat et. al. [46] deep learning has been applied for image set
classification. The main contributions of this work are : (i) Deep Reconstruction
Models (DRMs) that discover the underlying geometry of the data. (ii) Three voting
strategies – majority, weighted, preferentially weighted – take the inputs from the
DRMs (reconstruction error information) and decide the class of image data set.
(iii) Using automatic pose group approximation, classification performance is further
refined. Face recognition from Kinect data is formulated as RGB-D based image
classification problem. The performance of the algorithm has been evaluated on
various data sets and it has been found to be superior.
In [47] Convolutional Neural Network (CNN) has been developed for image
superresolution (SR). High resolution (HR) image is obtained from a low resolution
(LR) image using deep convolutional neural networks. The input to the network is LR
image and the output is a HR image. Deep learning based CNN and sparse CNN have
been compared. Deep learning achieves good quality and speed. Bicubic interpolation
is applied on the low resolution image as a pre-processing step. From the interpolated
LR image, overlapping patches are extracted and each patch is a high dimensional
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 47
vector comprising of feature maps. The number of feature maps is equal to the
dimension of the vector. The vectors are mapped to another set of high dimensional
vectors which form another set of feature maps. The vectors representing the patches
are aggregated to reconstruct the HR image.
In [48], the acceleration of test-time computation of deep CNNs has been proposed.
It has achieved a speed-up of 4x at the cost of slight increase in imageNet classification
error. The focus is on non-linearity enabling an asymmetric reconstruction without the
application of stochastic gradient descent algorithm. In [49], Deep Dynamic Neural
Networks (DDNN) has been applied for segmentation and multimodal gesture
recognition. A semi-supervised architecture with Hidden Markov Model has been
developed. Skeleton joint information along with depth and RGB images has been
used as inputs for classification and gesture recognition. The performance is on par
with other learning based algorithms. Kaiming He et al. [50] have proposed spatial
pyramid pooling (SPP-net) in their work. In conventional CNNs, the input images
have to be of fixed size. Spatial Pyramid Pooling has eliminated this requirement and
it generates a fixed length representation for whatever size of the input image applied.
The algorithm is robust to scaling and deformations of the input image. The feature
maps are extracted from the image once and they are pooled for the sub-images. It has
been found to improve detection and classification accuracy compared to existing
methods.
In [51], object detection based on region based CNNs has been proposed.
Supervised pre-training followed by domain specific fine tuning has been applied.
First, category independent region proposals are generated. They are the candidate
detections available. Second, a CNN extracts features from the regions. Third, a set of
class-specific SVMs are applied. Accuracy has improved by more than 50% relative to
the best result on VOC2012. Deep CNNs can learn task-specific features, which
improves the performance in computer vision [52]. Since acquiring labeled data can be
quite challenging, training CNNs using unlabeled data is proposed in this work. The
classification results have been found to outperform existing unsupervised learning on
several popular data sets. In [53], Hybrid CNN – Restricted Boltzmann Machine
model has been proposed for face verification. The uniqueness of this work is in
learning high level relational visual features with rich identity similarity information. It
is processed through the multiple layers of the network to extract high level features
and the Restricted Boltzmann Machine does the classification. Zifeng Wu et. al. [54]
have proposed deep CNNs for gait based human identification through similarity
learning. Changing gait patterns can be used for identifying persons and it has
achieved good results.
5. Conclusion
This chapter has given an insight into deep learning neural networks, image
classification methods and overview of deep learning applications to image
classification / computer vision / object identification. It is found that deep neural
networks are an integral part of machine learning that constitutes artificial intelligence.
Such deep learning techniques are required for solving complex tasks which the human
brain does automatically with ease. In trying to imitate or uncover the intricacies of
human intelligence, deep neural networks have paved the way for developing artificial
intelligence applications that have proved difficult in the past. Further research in this
48 A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification
field is required for doing complex tasks that the human brain accomplishes without
visible effort.
References
[1] http://www.deeplearningbook.org/
[2] Remote Sensing Classification Algorithms Analysis Applied To Land Cover Change, Master in
Emergency Early Warningand Response Space Applications. Mario Gulich Institute, CONAE. Argentina,
2014.
[3] www.mathworks.com
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471
[5] S. Yu, S. Jia, C. Zu, Convolutional Neural Networks for Hyperspectral Image Classification,
Neurocomputing, 219 (2017), 88–98.
[6] https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding
Convolutional-Neural-Networks/
[7] https://www.tensorflow.org/tutorials/wide_and_deep
[8] https://www.tensorflow.org/tutorials/
[9] https://www.r-project.org/about.html
[10] http://www.r-tutor.com/deep-learning/introduction
[11] https://en.wikipedia.org/wiki/R_(programming_language)
[12] https://in.mathworks.com/help/nnet/ug/deep-learning-in-matlab.html
[13] https://in.mathworks.com/discovery/deep-learning.html
[14] http://caffe.berkeleyvision.org/
[15] http://caffe.berkeleyvision.org/tutorial/
[16] https://devblogs.nvidia.com/parallelforall/deep-learning-computer-vision-caffe-cudnn/
[17] http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/
[18] https://en.wikipedia.org/wiki/Torch_(machine_learning)
[19] http://torch.ch/
[20] https://www.microsoft.com/en-us/cognitive-toolkit/
[21] https://www.microsoft.com/en-us/cognitive-toolkit/blog/2017/06/microsofts-high-performance-open-
source-deep-learning-toolkit-now-generally-available/
[22] https://channel9.msdn.com/Shows/Microsoft-Research/Microsoft-Cognitive-Toolkit-CNTK-for-Deep-
Learning
[23] https://keras.io/
[24] https://www.datacamp.com/community/tutorials/deep-learning-python#gs.IipBCuo
[25] https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
[26] https://en.wikipedia.org/wiki/Deeplearning4j
[27] http://mxnet.io/
[28] https://en.wikipedia.org/wiki/MXNet
[29] http://cs.stanford.edu/people/karpathy/convnetjs/
[30] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification With Deep Convolutional Neural
Networks, Proceedings of the 25th International Conference on Neural Information Processing Systems,
1097 – 1105.
[31] http://desktop.arcgis.com/en/arcmap/latest/extensions/spatial-analyst/image-classification/what-is-
image-classification-.html
[32] D. Lin, X. Xu, F. Pu, Bayesian Information Criterion Based Feature Filtering for the Fusion of Multiple
Features in High-Spatial-Resolution Satellite Scene Classification, Journal of Sensors, 2015 (2015),
1 – 10.
[33] M. Nielson, Neural Networks and Deep Learning, http://neuralnetworksanddeeplearning.com/, 2017.
[34] http://cs231n.github.io/classification/
[35] R.C. Gonzalez, and R.E. Woods, Digital Image Processing, Prentice Hall of India Pvt. Ltd., 2007.
[36] L. G. Martins, Introduction to Pattern Recognition, UCP, Porto, Portugal, (2011), http://artes.ucp.pt.
[37] http://www.jars1974.net/pdf/12_Chapter11.pdf
[38] E. Rogers, Y. Li, Parallel Processing in a Control Systems Environment, Prentice Hall, 1993.
[39] M. Law, A Simple Introduction to Support Vector Machines, Department of Computer Science and
Engineering, Michigan State University,
https://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf
[40] http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
A. Vasuki and S. Govindaraju / Deep Neural Networks for Image Classification 49
[41] C. J C Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 2 (1998), 121–167.
[42] P.-H. Chen, C.-J, Lin and B. Schölkopf, A tutorial on ν-support vector machines. Applied Stochastic
Models in Business and Industry, 21 (2005), 111–136. doi:10.1002/asmb.537.
[43] H. Abdi and L. J. Williams, Principal Component Analysis, 2 (2010), Computational Statistics, John
Wiley and Sons, 433 – 459.
[44] L. I Smith, A Tutorial on Principal Components Analysis, (2002),
http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf
[45] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, An Efficient k-
Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis
And Machine Intelligence, 24 (2002), 881 – 892.
[46] M. Hayat, M. Bennamoun, S. An, Deep Reconstruction Models for Image Set Classification, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 37 (2015), 1 – 15.
[47] C. Dong, C. C. Loy, K. He, X. Tang, Image Super-Resolution Using Deep Convolutional Networks,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 (2016), 295 – 307.
[48] X. Zhang, J. Zou, K. He, J. Sun, Accelerating Very Deep Convolutional Networks for Classification
and Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 (2016 ), 1943 –
1955.
[49] D. Wu, L. Pigou, P. J. Kindermans, N. D. Le, L. Shao, J. Dambre, and J. M. Odobez, Deep Dynamic
Neural Networks for Multimodal Gesture Segmentation and Recognition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 38 (2016), 1583 – 1597.
[50] K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for
Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (2015), 1904
– 1916.
[51] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-Based Convolutional Networks for Accurate
Object Detection and Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence,
38 (2016), 142 – 158.
[52] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative Unsupervised
Feature Learning with Exemplar Convolutional Neural Networks, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 38 (2016), 1734 – 1747.
[53] Y. Sun, X. Wang, X. Tang, Hybrid Deep Learning for Face Verification, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 38 (2016), 1997 – 2009.
[54] Z. Wu , Y. Huang, L. Wang, X. Wang, T. Tan, Comprehensive Study on Cross-View Gait Based
Human Identification with Deep CNNs, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39 (2017), 209 – 226.
50 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-50
Abstract. Hand gestures and Deep Learning Strategies can be used to control a
virtual robotic arm for real-time applications. A robotic arm which is portable to
carry various places and which can be easily programmed to do any work of a
hand and is controlled by using deep learning techniques. Deep hand is a
combination of both virtual reality and deep learning techniques. It estimated the
active spatio-temporal feature and the corresponding pose parameter for various
hand movements, to determine the unknown pose parameter of hand gestures by
using various deep learning algorithms. A novel framework for hand gestures has
been made to estimate by using a deep convolution neural network (CNN) and a
deep belief network (DBN). A comparison in terms of accuracy and recognition
rate has been drawn. This helps in analyzing the movement of a hand and its
fingers which can be made to control a robotic arm with high recognition rate and
less error rate.
1. Introduction
1
K. Martin Sagayam, Department of ECE, Karunya University, Coimbatore 641 114, India; E-mail:
martinsagayam.k@gmail.com.
K. Martin Sagayam et al. / Virtual Robotic Arm Control 51
in the field of health care, gaming, mobile phones and almost in every filed. The
development of VR was realized with the deployment of two technologies. The first
was the development of Kinect sensor by Microsoft Corporation© [2]. Kinect sensor
features an RGB camera, a multi array microphone and depth sensor that is capable of
tracking body movements. In using this technology both color and depth information is
obtained. This results into an improved experience on human computer interaction
(HCI). One limitation is that it is very challenging to achieve 100% accurate results due
to low resolution and sometimes interference due to other objects. The second
technology that has lead to VR development is known as Deep Neural Networks
(DNN). It is also known as Deep Learning (DL). It mimics the functionalities of the
human brain. It acts as a platform for AI. It contains neuron-like structure and predicts
the output like the way a human brain behaves.
The use of DL is becoming popular as major companies such as Facebook, Google,
Microsoft, and Amazon are further developing and applying techniques from DL. In
applying DL, we can predict various outcomes and this activity is useful in various
applications. Many industries are adopting techniques and solutions from DL as a mean
for prediction. In various applications, it is viewed that AI can “think for itself” by
simplifying the work of human beings. It is viewed that DL should be considered as the
starting point of AI.
1.1. Outcome
Based upon our research in which is presented in this book chapter, we plan to
overcome the difficulties faced by hand gesture recognition and improve the accuracy
rate of hand gestures. The difficulties that we aim to improve are as follows:
x Using a single camera to capture the picture of the hand gestures and then
estimating other parameters to lower the total cost.
x To improve the overall detection of various hand gestures.
x Extracting the features of a hand in precise manner that is more precise, in a
fixed lighting background condition.
The VR arm is a concept, which is achieved, can be used for various real-time
applications [23]. The VR arm is a robotic arm, which is portable and can be controlled
from a distance. It can be controlled by using DL techniques, which involves the usage
of DL algorithms. Normally, to obtain a 3-D image of a hand, a multi-camera setup or
GPU acceleration is deployed. It is practically challenging, and it limits to the public,
which makes it impossible for the common man to use it. The realization of this multi-
camera setup is very expensive [10]. In this chapter, we present a more robust
technique for the purpose of building a robotic arm operated via hand gestures and deep
learning, by utilizing just a single camera and estimating the other parameters [22].
There are many approaches to implement hand gesture recognition and estimation. To
detect the image of a hand, three types of methods are utilized:
52 K. Martin Sagayam et al. / Virtual Robotic Arm Control
In Fig. 1 have described about the framework with four main stages is presented
beginning with image acquisition stage, then leads to pre-processing stage and
segmentation stage with classification being the final stage.
Yes No
Update Virtual Are the
Robotic Arm performance
metrics satisfied?
The dataset consisting of various images of different hand gestures were obtained from
the Website of Imperial College of London [34]. Five datasets, each corresponding to
different gestures are considered in this book chapter. The five hand gesture deployed
are: a) Center-to-left, b) Compress hand, c) V-Shape center-to-left, d) V-Shape center-
to-right, and e) Left-to-right. Each class corresponds to each actions of hand gesture,
and each comprises of 100 images. Each sequence was filmed in front of a stationary
camera having roughly isolated gestures. Thus, substantially large intra-class variations
in spatial and temporal alignment reflect the datasets. Each hand image is of size 320 x
240 and is available in jpeg format. Fig. 2 represents the typical sample sequences of
the five different hand datasets.
Center to left
Compress hand
Left to right
Under image acquisition, images are captured with camera and with the help of various
sensors. The sensors presents in the camera further process the image, process various
values based on image acquisition methods, and then these values are directed to the
pre-processing unit.
2.3. Pre-processing
The hand gesture parameters are obtained from the image acquisition in a fixed
background. By using various edge detection techniques like Sobel, Canny, Prewitt,
etc., the appropriate edges of the image are detected and then these edge values are sent
to the further process such as feature extraction and segmentation.
2.4. Segmentation
The hand gesture parameters obtained may contain a lot of values. Therefore it takes a
lot of time to process these values. Therefore, to reduce time and to improve latency,
feature extraction is done. Important parameters are extracted and then processed. Then
these parameters are further segmented using occlusion techniques.
2.5. Classification
In this work, have presents two deep learning classifier realizations: DBN and CNN.
The parameters have normalized in terms of scale, translation and rotation to train from
the hand posture image using Deep Belief Network (DBN). CNN has an effective
parallel computation to those parameters than DBN [5]. After processing through
various layers of the neural network, the final output of an image is obtained. By using
DL techniques, it shows that the improved results in terms of accuracy and recognition
rate.
3. Image Acquisition
Image acquisition can be acquired by two methods are capturing and pipelining, it has
been described in section 3.1 and 3.2 detailed.
The gesture of a hand is captured with the help of a webcam or any camera connected
to a computer or laptop. Mostly all webcams work generally in the same manner. Every
camera has an image sensor chip build into it. These image sensors are used to capture
moving images and convert them into streams of digits. The image sensor chip is
considered to be the heart of a camera. Camera sources present today are mostly based
on two types of sensors: Complementary Metal Oxide Semiconductor (CMOS)
technology and Charged Coupled Device (CCD) sensor are used in recent approaches.
In tradition, to detect and track hand gesture using Kinect sensor, accelerometer, etc., [1,
K. Martin Sagayam et al. / Virtual Robotic Arm Control 55
2]. It captures the raw stream data and skeletal data to the system. The main drawback
of Kinect sensor is ineffectiveness, easily hacked and privacy issues.
The image acquisition by the sensors is processed and the acquired signal is sent to the
next building block for processing. CMOS sensors has gives yield to a digital signal in
parallel streams of pixel segments in either YCbCr or RGB design, along vertical and
horizontal synchronization and a clock of pixel [7]. There are odds of permitting an
external clock and adjusting signs to control the exchange of image from the sensory
nodes. A CCD utilizes an analog front-end chip, for example AD9948, will handle the
simple processed analog data, quantized into digital form and creates a suitable timing
to check the CCD cluster [8, 9]. The parallel computing yields a digitalized output from
AFE in 10-bit, or 12-bit determined output values per pixel. Nowadays, low voltage
differential (LVD) signaling, has turned into a vital option to the simultaneous bus
information approach. LVD’s is an easy, low pin-count, fast serial interconnector that
has better commotion invulnerability and low control utilization than the standard
approach. The process of the input data does not stop at the sensory node. There are a
few calculations determined by the processing element before the shutter catch is
discharged. The concentration and presentation frame work with the assistance of
mechanical camera segments to control the focal point position in light scene attributes.
Auto-exposure algorithm has determined the illumination over scene locales to make
up for exposed regions by controlling either shade speed or gap estimate or both. The
fundamental reason here is to keep up the relative difference between various locales in
the input data and to accomplish an objective normal luminance. Auto-focus
calculations are isolated into two classes active techniques which come under the main
classification utilize infrared or ultrasonic emitters/ beneficiaries to gauge the
separation between the cameras and object being acquired by the device. Inactive
strategies, which come under the previous, on acquiring the data from the acquisition
device, settle on the centering choices. The media processor controls different focal
points and screen engines by means of PWM yield motions in both of the above
prototypes. It also alters the automatic gain control (AGC) sensory circuits for
automatic control unit. The final stage is the outcome of the desired data from the
camera sensor has been obtained.
4. Image Pre-processing
Lawrence Robert has introduced Robert’s edge detection strategy in the year 1965 [10].
It rushes to process and to compute the 2-D spatial slope on an input data by its
gradient operator. This strategy accentuates areas of high spatial recurrence, which
regularly edges within an image. The contribution of this kernel function towards the
input is an image of gray scale type with regards to the yield is most used for this
method [10]. The pixel level values in every point in the specific yield speak to the
assessed, to compute the extent of the spatial domain of gradient data at that instant
point.
Fig.3 Robert’s edge detection method used couple of 2x2 convolution kernels
The Sobel edge detection method was introduced in the year 1970 [32]. The image
partition is the major goal in this work, Sobel edge detection technique has perceives
edges utilizing Sobel operator approximates to the subordinate. It is shown by the
edges at that specific focus where the gradient is most noteworthy. It represents the
space of high spatial recurrence that relates to the edges by making a 2-D spatial slope
amount on that specific image. It is used to find the determined total gradient size in ‘n’
input grayscale image at every point.
Canny edge detection strategy is maintained as one of the predominant method used in
industrial applications. This specific method was presented by John Canny in the year
1983, which he made for his Master’s postulation at MIT. Despite the fact that it was
presented numerous years prior, regardless it outperforms the majority of the newer
algorithm that have been produced. The main objective of Canny edge detection
approach is to separate commotion from the input data before locating the edge point
within an image. It has the attitude to find the edges and the genuine estimation of the
limit without disturbing the components of the edges in the input data. It is less
K. Martin Sagayam et al. / Virtual Robotic Arm Control 57
defenseless to commotion than the strategies for Robert and Sobel operators. The
calculations required in canny edge detection technique as listed below [12]:
x Convolution of image f(r,c) with a Gaussian capacity to get smooth image
f^(r,c).
f^(r,c)=f(r,c)*G(r,c,6)
x To compute edge strength, apply contrast gradient operator and afterward
bearing and edge size are required.
x Exert non-maximal or basic concealment to the inclination extent.
x Finally, apply edge to the non-maximal concealment image and the edges are
recognized.
The Prewitt edge detection strategy was invented by Prewitt in the year 1970 [32].
Prewitt operator discovers its motivation in the estimation of greatness and introduction
of an edge. Distinctive gradient operator needs a very tedious figuring to appraise the
bearing from the extents in the x and y directions. In any case, the compass edge
identification extricates the direction with the most astounding reaction, specifically
from the kernel function. Despite the fact that it kept to 8 conceivable directions,
information exhibits that the most direct course figuring’s less great. This gradient
operator is ascertained in the 3x3 matrix representation of neighborhood pixel for 8
consequent directions. All the 8 convolution masks are determined. One of the
complicated masks is then added up, in particular with the handiness of the biggest
module. It is slightly less demanding to actualize than Sobel operator; however it
creates to some degree of noise outcomes.
ª1 1 1º ª 1 0 1º
H1 « 0 0 0 »;H « 1 0 1»
« » 2 « »
«¬ 1 1 1»¼ «¬ 1 0 1»¼
Fig.5 Prewitt detection method used 3x3 convolution kernel
5. Feature Extraction
On completion of the pre-processing of the hand gesture image, the respective feature
vectors associated to the hand gestures are calculated. Various feature extraction
techniques are available, but the most suitable one depends upon the type of features to
be extracted from the hand gesture. The main objective of feature extraction is to lessen
the amount of variables used to explain the data under supervised learning [13, 14].
Feature extraction is used to extract derived values from an initial set of input data that
should be more informative, non-redundant allowing to interpret the data which can be
understood by human and also allows reduction in dimensionality of the hand image
without loss of information [16, 17].
At the each time interval, large amount of data to be processed results in overloading of
system; only some features that are more informative and relevant can be used for
processing. Generation of too many variables is common while handling large amounts
58 K. Martin Sagayam et al. / Virtual Robotic Arm Control
of data [15, 16]. Therefore, the system requires large amount of memory and more
power, which results in failure while generalizing the data. The phenomenon, which
describes the feature vector, is known as feature descriptors [14].
6. Classification
Deep neural networks (DNN) are derived from artificial neural network that include
more than two layers of dominant variables in its structure called as Deep Learning
(DL) [3]. The main objective of DL is used to select and extract the feature from an
input hand gesture data automatically, and then classify the target set with respect to
the defined trained set [5, 6]. The detailed description of deep learning techniques has
been discussed in the subsections.
Artificial neural networks (ANNs) has comprised with various hidden layers in the
structure is also known as Deep Neural Networks (DNNs) are complex to deal with and
their aim is to withdraw and close features automatically from an input data, and then
sort them towards their respective classes. In recent years, the DL uses has been
expanded in various research fields. Vinod Nair and Geoffrey E Hintom [31] have
introduced a top-level model for DBNs, which is evaluated in the form of 3-D object
recognition. In this work, DBN based hand gesture recognition has been developed.
The parameters have normalized in terms of scale, translation and rotation to train from
the hand posture image using Deep Belief Network (DBN). The training process of
DBN’s is too difficult to parallelize through the system, in order to increase the
robustness of the system by computing parallel using convolution neural networks
(CNN). It can be used in various pattern recognition applications such as written
recognition, face recognition, action recognition etc.
DBNs are considered to be productive model, which are probabilistic in natural way. It
comprises a number of hidden layers over a lower layer of over units that acquire an
input data. DBN set up connections that are undirected in between the upper units and
connections that are directed in between the lower units as shown in Fig. 6(a) [3].
Naïve DBNs are generative neural networks that stack RBMs (resembling generative
autoencoders). DBNs can be implemented as a pile of restricted Boltzmann machines
(RBMs). An RBM is considered from every two adjacent layers. There are two steps
involved in the training process of DBN. They are known as pre training and fine
tuning [18].
Restricted Boltzmann Machines (RBMs) consists of 1 hidden unit and 1 visible unit,
which forms a bipartite graph without hidden-hidden 0r visible-visible connection,
similar to that of displayed in Fig. 6(b). As there exists no hidden-hidden or visible-
visible connection, the conditional distribution over hidden layers h and visible layers v
are written in logistic functions:
p(h j 1| -;T ) V (¦ wij-i a j ) (1)
i
x Pre-training
Greedy layer-by-layer algorithm, are also known as pre-training is focused on
training a accumulation of RBMs, where every two layers that are adjacent is to be
considered as an RBM and we use contrastive divergence to develop RBM weights.
The training of the first layer of RBM using various training data is carried on.
Then we utilize eqn. 2 to calculate hidden class activation probability P(h|v). For
higher levels of RBM, we would be using P(h|v) as training data, which is useful
in obtaining features from the output of the previous layer (i.e. each set). After
completion of pre-training, the starting values for all the units of the network and
the number of hidden layers are obtained which is equal to the number of RBMs.
x Fine-tuning
DBNs have the capacity of obtaining high-level representation from higher-level
dimensional data. For the purpose of achieving hand gesture recognition, we will
also attach an output layer consisting of randomly set weights at the tip of the
deep belief networks after pre-training all parameters T of DBN. Since fine-tuning
comes under supervised learning, we must acknowledge the respective label
towards each hand image. For that, we utilize back-propagation for the purpose of
fine-tuning all the weights with labeled input data.
The canonical functionality of above example of CNN can be simplified into four key
areas [20]:
x The input layer will preserve the pixel parameters of that particular image.
x The convolutional layer helps in determining the outcome of neurons, which
are joined with the local regions of the input by calculating scalar product
between the regions connected to the input volume and their respected weights.
The rectified linear unit commonly known as ReLu, aims to put in ‘element-
wise’ activation function such as sigmoid to the output of the activation
exerted by the previous layer.
x The pooling layer performs down sampling along with the spatial
dimensionality of the given input, which helps in reducing the number of
parameters within that activation function.
x The fully connected layers render similar to that of the standard artificial
neural networks by producing class scores from the activations, to be used for
classification. ReLu can be used between these layers for the purpose of
improving performance.
Therefore, by the above process, CNNs are able to change the original input layer by
layer with the help of convolution and down sampling techniques to exert class scores
for classification purposes.
Various experiments have been carried out in series to examine various sectors of the
proposed system. Analyses have been done on hand gestures with five different classes.
These hand gestures are trained and then tested. For hand gesture training and testing
experiments, first we have considered five sets of hand gestures such as center to left,
compress hand, V-shape center to left, V-shape center to right and left to right [4]. We
have recorder frames from video clips containing hand gestures and have collected 500
frames in the form of the images in total. All images have been transformed into
grayscale and they have been rescaled into images with resolution of 32x32 pixels [3].
The intensity of each pixel is modified into (0-1).
62 K. Martin Sagayam et al. / Virtual Robotic Arm Control
(a)
(b)
Fig.9 (a) Training and testing error rate of DBN (b) Error rate while using DBN, with and without pre-
training on training set
K. Martin Sagayam et al. / Virtual Robotic Arm Control 63
We have opted to use DBN, consisting of hidden layers, three in number, with 500,
1000, and 2000 hidden units in each hidden layer. The input units are assumed to be
1024 in number since the resolution of each input image is 32x32 pixels. The final
layer consists of five units. Each unit represents five sets of hand gestures with 60
frames per seconds have been recorded. During pre-training period of the starting layer,
intensities of all pixels are assumed as input units. After the pre-training of DBN on
unlabeled various hand gestures, a classification unit consisting of five output units is
attached to the labeled images with the final layer of the DBN model as its input. Back
propagation algorithm is utilized in fine-tuning the whole network with 32 iterations.
The rate of training and testing fine-tuned values are shown in Fig. 9(a). The
experiments have been carried out without pre-training in DBN. The results obtained
are shown in Fig. 9(b) and the pre-training process using unlabeled data has a great
beneficial impact in the training process of DBN. Table 1 shows that the detailed
analysis of DBN architecture with respect to layer, connections and weights.
Table 1. Comparison of Internal layer, Weights and Connections
Layer Types of Layer Maps No. of Weights Connections
Fig.10 Error rate for training and testing of CNN as a function of the number of iterations
After the completion of training, DBN and CNN are made to run on a testing set.
HOG+SVM is also tested with our dataset for comparison. For each hand data, the 144-
D HOD feature is extracted [21]. The image resolution of 32x32 is split into 16 cells.
The gradient of a cell is split into 9 orientation bins and each cell consists of 8x8 pixels.
Without overlapping, each image consists of 4 blocks. For the purpose of training
multiclass SVM, LIBSVM is used with linear kernel for recognition. Table 2 shows
that the results related to the accuracy rate of DBN, CNN and HOG+SVM for each
hand gestures. On comparison of values from the results obtained above, it is found
that both CNN and DBN obtain has high recognition rate. The total means relate to
accuracy rate for all hand gestures are listed here: DBN – 98.35%, CNN – 94.74% and
HOG+SVM – 89.62%. The proposed system with DBN is superior than CNN
constructed result. Another result is that both DBN and CNN perform much better
compared to the baseline HOG+SVM method.
For implementing robotic arm control, a 3-D model of robotic arm is chosen [34, 35].
Variety of functions respective to each hand gesture movements are coded and saved in
the database. Hand gestures are used to control the robotic arm, have five different
movements of robotic arm related with five different actions are considered. If the
captured gesture matches with that of the gesture stored in the trained database, the
instruction sets are activated and passed on to the robotic arm model for execution. Fig.
11 shows that the movement of a robotic hand corresponding to five different hand
gestures. Temporarily only five movements have been made corresponding to five
K. Martin Sagayam et al. / Virtual Robotic Arm Control 65
different actions. In further, more actions are to be considered and executed by the
robotic arm model. Therefore, using different hand gestures, robotic arm is controlled
which in real-time can be used for various applications. As the accuracy rate is also
higher from the analysis, recognition of hand gestures will be of higher recognition rate
and will be error frees making way to be useful in the real world for performing
complex operations.
(a) Compress
(b) Flat
The various actions performed by a VR arm model are as follows: (a) controlled arm
bens down when hand is closed, (b) controlled arm changes into default position when
hand is made flat, (c) controlled arm moves downwards backside when hand is moved
from center to right, (d) controlled arm moved towards the left when hand in V-shape
66 K. Martin Sagayam et al. / Virtual Robotic Arm Control
is moved from center to left, and (e) controlled arm moves towards the right when hand
in V-shape is moved from center to right.
8. Conclusion
By this research, a real time hand gesture recognition module has been developed to
control robotic arm. Firstly, we have considered hand gesture data from MNIST
datasets of five different classes with 60 different frames in each class. Both DBN and
CNN have greater accuracy than HOG+SVM. Instead of a normal camera, by using 3-
D depth camera, we can further enhance the performance. We can obtain both 3-D
spatial data and depth data from this 3-D depth camera. With this data, the prediction
rate will be higher, and these parameters can be used to obtain the 3-D hand image.
With these techniques, hand gesture recognition can be made efficient and realizing
this system could be inexpensive. Due to higher accuracy rate, hand gesture recognition
would be more effective to the system to recognize. Finally, through various
experiments we have analyzed that by applying DL techniques through neural networks,
the accuracy rate of hand gesture recognition has increased in greater extent. The
output data obtained using DL techniques can be fed to control a robotic arm which can
be employed for various real time applications like performing surgeries, driving a car
from home, and doing any other tasks by being at one place. This results in to less time
consumption from this experimentation. Hence, future research work tends to work on
implementing this VR arm through hand gesture recognition in VR applications.
References
[1] A. El Gamal, H. Eltoukhy, CMOS image sensors, IEEE Circuits and Devices Magazine 21(3) (2005), 6-
20.
[2] Robert McCartney, Jie Yuan, hand-Peter Bischof, Gesture recognition with the Leap motion controller,
published in RIT scholar works, Rochester Institute of Technology (2015).
[3] Di Wu, Lionel Pigou, Pieter-Jan Kindermans, Nam Do-Hoang Le, Ling Shao, Joni Dambre, Jean-Marc
Odobez, Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition, IEEE
Transactions on pattern analysis and machine intelligence 38(8) (2016).
[4] Natalia Neverova, Christian Wolf, Graham Taylor, Florian Nebout, ModDrop: Adaptive multi-modal
gesture recognition, arXiv:!501.00102v2[cs.CV] (2015).
[5] Ao Tang, Ke Lu, Yufei Wang, Jie Huang, Houqiang Li, A real time hand posture recognition system
using deep neural networks, ACM Transaction Intelligence System Technology 9(4) Article 29 (2013).
[6] Pooyan Safari, Martin Kleinsteuber, Deep learning for sequential pattern recognition, Technische
Universitat Munchen (2013).
[7] Nick Waltham, Observing photons in space-CCD and CMOS sensors, Springer publisher New York
(2013), 423-442.
[8] M. Bigas, E. Cabyuja, J. Forest, J. Salvi, Review of CMOS image sensors, Microelectronics, Elsevier
publisher 37 (2006), 433-451.
[9] Courtney Peterson, How it works: The charged-coupled device (CCD), Published by George town
university in the journal of young investigators, Inc. (2009).
[10] Van kasteren T., Englebienne G., and Krose B. J., Human activity recognition from wireless sensor
network data: Benchmark and software, In activity recognition in pervasive intelligent environments
(2011), 165-186.
[11] Rafael C. Gonzalez, Richard E. Woods and Steven L. Eddins, Digital Image Processing using MATLAB,
Pearson education Ptd. Ltd., Singapore (2004).
[12] Muthukirishnan R., M. Radha, Edge detection techniques for image segmentation, International journal
of computer science and information technology 3(6) (2011).
K. Martin Sagayam et al. / Virtual Robotic Arm Control 67
[13] Swapnali B., Vijay K., Varsha H., Feature based object detection scheme, Technovision International
Conference (2014).
[14] Bay H., Tuytelaars T., VanGool L., SURF: speeded up robust features, ECCV 1 (2006), 404-417.
[15] Bhosale Swapnali B, Kayastha Vijay S, Harpale Varsha K., Feature extraction using SURF algorithm
for object recognition, International journal of technical research and applications 2(4) (2014), 197-199.
[16] Bassem Sheta, Mohammad Elhabiby and Naser El-Sheimy, Assessment of different speeded up robust
features (SURF) algorithm resolution for pose estimation of UAV, Survey (IJCSES) 3(5), 2012.
[17] Kosuke Mizuno, Yosuke Terachi, Kenta Takagi, Shintaro Izumi, Hiroshi Kawaguchi and Masahiko
Yoshimoto, Architectural study of HOG feature extraction processor for real-time object detection,
IEEE workshop on signal processing systems (2012).
[18] G. E. Hinton, R. R. Salakhutidinov, Reducing the dimensionality of data with neural networks, Science
313(5786) (2006), 504-507.
[19] Keiron O`Shea, Ryan Nash, An introduction to convolutional neural networks, Published in research
gate, arXiv:111.08458v2 (2015.)
[20] Yann LeCunn et al, Generalization and network design strategies connectionism in perspective (1989),
143-155.
[21] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin and others, A practical guide to support vector
classification, (2003).
[22] Ganesh Choudary B., Chetan Ram B V., Real time robotic arm control using hand gestures, ICHPCAA
International Conference, IEEE Publisher (2014).
[23] Ayan Sinha, Chiho Choi, Karthik Ramani, Deep hand: robust hand pose estimation by completing a
matrix imputed with deep features, IEEE conference on computer vision and pattern recognition, (2016).
[24] Marco Fagiani, Emanuel Principi, Stefano Squartini, Francesco Piazza, A new system for automatic
recognition of Italian sign language – neural nets and surrounding, Springer publisher, (2013), 69-79.
[25] Micheal Van den Bergh, Luc Van Gool, Combining RGB and ToF cameras for real time 3D hand
gesture interaction, Applications of Computer Visions, IEEE publisher, (2011), 66-72.
[26] Lu Xia, Chia-Chih Chen, J K Agarwal, View invariant human action recognition using histograms of
3D joints, Computer vision and pattern recognition workshop, IEEE publisher, (2012).
[27] Pooanm Suryanaryanan, Anbumani Subramanian, Dinesh Mandalapu, Dynamic hand pose recognition
using depth data, 20th International Conference on Pattern Recognition, IEEE publisher, (2010).
[28] S Malassiotis and M G Strinzis, Real time hand posture using range data, Image and Vision
Computing 27(7) (2008), 1027-1037.
[29] Chuging Cao and Ruifeng Li, Real time hand posture recognition using Haar like and topological
feature, Machine vision and human machine interface, IEEE publisher, (2010), 683-687.
[30] Balaz Tusor and A R Varkonyi-Koczy, Circular fuzzy neural network based hand gesture and posture
modeling, Instrumentation and Measurement Technology Conference, IEEE publisher, (2010), 815-820.
[31] Vinod Nair and Geoffrey E. Hinton, Rectified linear units improve restricted Boltzmann machines,
Published by department of computer science, University of Toronto (2009).
[32] Rafael C. Gonzalez, Richard E. Woods and Steven L. Eddins, Digital image processing using MATLAB,
Published in Pearson Education Ltd., Singapore (2004).
[33] Yann LeCun et al, Generalization and network design design strategies- Connectionism in perspective,
(1989), 143-155.
[34] T-K. Kim and R. Cipolla, Canonical correlation analysis of video volume tensors for action
categorization and detection, IEEE Transaction on Pattern Analysis and Machine Intelligence 31(8)
(2009), 1415-1428.
[35] Jagdish Raheja, Radhey Shyam, Real-time robotic hand control using hand gestures, published in
research gate, publication no: 221923729 (2012).
68 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-68
1. Introduction
With regards to the utilization of digitized images over the World Wide Web, it is
known to everybody that there could be a huge number of clients working with digital
data. This advanced data can be as computerized images as images are one of the most
ideal methods for sharing, understanding and retaining the data. Image retrieval can be
arranged into two sorts; exact image retrieval and relevant image retrieval. Exact image
retrieval [1] can be alluded to as image acknowledgment and its genuine applications
[2] has been implemented by different researchers. It obliges images to be coordinated
precisely or 100 percent, though relevant image retrieval is in light of contents and
there is adaptable size of importance relying on definite element values. A more
prominent number of controllers of computerized data infers a more noteworthy
number of advanced image handling/sharing included bringing about a more prominent
measure of multifaceted nature while overseeing and controlling advanced content; in
this manner, it is very frequently required from an advanced content administration
framework to give an elegant interface to productively dealing with the utilization of
advanced images in certain applications. The essential objective of an image
administration framework is to pursuit images and to contend with the applications in
the present time, image looking ought to be founded on its visual content. For this
reason, numerous analysts have conceived numerous methods in view of various
parameters to acquire precise outcomes with high recovery execution. The starting
points of research in the field of content based image retrieval were laid in the late 70's.
Database advancements for pictorial applications were talked about without precedent
for that period and the scientists got fascination for this space from that point forward.
1
Rajeev Kumar Singh, Suchitra Agrawal, CSE-IT Department, MITS, Gwalior (M.P.), India; E-mail:
suchiagrawal0007@gmail.com.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 69
Previous image retrieval procedures were not that clever and modern and they
were not ready to scan for images in light of its visual components rather those
strategies depended on content based metadata of pictures. All pictures put away in the
database were initially labeled with the metadata and after that images were looked in
view of the image metadata. Content based image retrieval techniques were utilized for
traditional database applications. They were utilized with parcel of business
applications and purposes however expanding utilization and volume of advanced
images made execution and precision issues for content based image retrieval
techniques.
allows system to depict the complete color dispersion in an image and is most utilized
system for CBIR [3] on account of its proficiency and viability.
Color histograms strategy has the benefits of expediency, low request of memory
space and because they are not modified with the change in size and resolution of the
image, it holds the broad consideration.
Texture based retrieval: The recognizable proof of particular textures in an image
is accomplished basically by considering texture as a two-dimensional gray-level
variety. Textures are depicted by contrasts in brightness with high frequencies in the
range of image. They are helpful in recognizing ranges of images with comparative
color, (for example, sky and ocean, or water, grass). An assortment of techniques has
been utilized for measuring texture likeness; the best settled rely on upon looking at
estimations of what are outstanding as second-request insights evaluated from query
and database pictures. Basically, these gauge the relative brightness of picked sets of
pixels from every image. From these it is conceivable to measure the image texture, for
example, contrast, coarseness, directionality and regularity [4] or periodicity,
directionality and randomness [5].
Shape based retrieval: By utilizing the histogram of edge detection, shape
information can be extracted. Strategies for shape feature extraction are elementary
descriptor, Fourier descriptor, template matching, quantized descriptors, canny edge
detection [6] and so on. Shape elements are less advanced than their color and texture
partners as a result of the natural unpredictability of representing to shapes. Specifically,
regions covered by an object must be identified so as to portray its shape, and various
known segmentation algorithms are applied to identify the low-level color and texture
components along with region growing or split-and-merge. In any case, for the most
part it is not really conceivable to accurately fragment a picture into important areas
utilizing low-level components because of the assortment of conceivable projections of
a 3-D protest into 2-D shapes, the multifaceted nature of every individual object shape,
the presence of shadows, occlusions, non-uniform illumination, changing surface
reflectivity, etc. [7].
The remaining part of the chapter have been organized as follows: Section 2 covers
the techniques introduced for retrieving images on the basis of content present in the
query image. Section 3 gives a brief review about the Sketch Based Image Retrieval.
Section 4 then concludes by providing future scope for image retrieval techniques.
The very first system developed for content based image retrieval is QBIC [8] by IBM
is one the well-known commercial system. Many techniques since then have been
proposed for CBIR with use of different other methodologies. Few of them are
discussed in further sections.
An agent based searching framework [9] is deployed. The plan has proposed use of
numerous agents to reduce the search space by utilizing an interleaving system. The
neural system assumes a vital part in the creation of feature-contained vector to be
utilized by the agents for image retrieval in an interleaved form. The system treats the
problem as a distributed query and with the reduction in the search space; images are
retrieved from the database simultaneously. It also used the concept behind Lilies to
deal with multiple agents.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 71
An online image retrieval framework that allows multiple queries has been
presented in [10]. The proposed strategy has been executed over the web and test
outcomes are combined with loads of various queries and tests. Neuro-fuzzy logic is
used to interpret the query expressions provided by the user. Input query consists of
colors to be retrieved from the database of images.
In contrast to linear relationship between different features of an image, a method
of dynamically upgrading the similarities among components of an image has been
proposed with the utilization of radial basis function [11] neural system which permits
the collection of heterogeneous image elements for more relevant image retrieval. The
intersection of histograms for feature vector of query image and database image
is calculated as:
∑
,
1 (1)
,
where
∑
,
| | ∑
,
1 , 2 , 3 and
is used to denote the number of bins in histogram. The user's feedback is ranked as:
0.9,
$
%
0.5,
'
$
0,
(
The framework gives out better results in comparison with different strategies in
the light of implemented results.
The researchers introduced a plan for altering the feature vector comparison
strategy [12] at run time in view of the user's preference. The error function defined is
as:
) ∑ ( (2)
where * indexes the * training pattern, is the rank provided by the user and ( is
the rank obtained by system.
Thus, the framework acts like a human and takes in the similarity matching
components on various queries. The algorithm was tested on the NIST Mugshot
Identification Database and results obtained are quite acceptable.
To achieve the higher efficiency in retrieval results, the search space can be
increased for specific features providing a more clear view to extract images from
database. It can be implemented along with relevance feedback [13] where the system
can learn from the user feedback mechanism. Implemented system performed
experiments on database of images more than 50 k and the results showed that the
increase in feature search space can retrieve more relevant results.
In [14] images of interested organs are retrieved from the database. Images of
healthy and fit organs are put in the database and when performing examination on any
therapeutic image comprising of various organs, the proposed system permits the user
to distinguish organs as indicated by the ones put in the database. Neural systems are
utilized to classify the query image and images are then retrieved from the database by
calculating the distance between their features.
A relevance feedback technique [15] for updating the query with search criteria
has been implemented to improve the performance efficiently. The feature vector is
updated as:
72 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
ሺሻ
ൌ ∑אೃ െ ∑אಿ (3)
ೃ ಿ
where ൌ ͳǡʹǡ ǥ ǡ
, and are sets of relevant images and irrelevant images
respectively,
and
are number of images in the above two sets respectively,
while and are suitable constants [63]. The similarity between query image and
database image is then calculated as:
ሺሻ ሺሻ
ሺሻ ǡ ൌ ∑
ǡ ǡ (4)
ሺሻ ሺሻ
where is the feature vector of the database image,
ǡ ǡ is the
normalized similarity measure between query image and retrieved image, and
are inter-weight and intra-weight vectors for elements updated using the Eq. (5) and (6)
respectively.
ሺೖሻ ሺೖሻ
ൌͳെ ሺೖሻ ሺೖሻ ൌ ͳǡʹǡ ǥ ǡ
(5)
ೌೣ
ሺሻ ሺሻ
where is the average value of variance distance of feature vector, and
ሺሻ
are minimum and maximum variance distance among all feature vectors.
ሺሻ
ൌ ሺೖሻ ሺሻ
ೕ
where is the standard deviation. The use of simultaneous processing of moving
query vectors and updating weight factors accelerated the speed of convergence for the
relevance feedback retrieval system and can retrieve results effectively as compared to
conventional methods of image retrieval.
The multi-label graph-cut algorithm [16] group edges according to two key Gestalt
principles i.e., continuity and proximity. Perceptual edge grouping framework
produces better edge groups which can be used to improve a variety of higher level
tasks. During sketch generation, edge grouping is carried out via edge detection and
then filtering to generate the sketch. HOG is extracted for each machine generated
sketch and query sketch. Edge map is constructed and then perceptual grouping
framework is applied to group salient edges together.
The dynamic incorporation [17] with user interest for a search criterion along with
utilization of intelligent agents to mix along the user interested search matrix in the
upcoming retrieval handle makes the procedure to work smartly. The prototype system,
Locating Images Easily (LIZY) developed to test supports both query-by-sketch and
query-by-example. The efficiency of the system was evaluated using the following:
ʹ כ כ כȀሺ ሻ
where is the precision, is recall, is usability and is efficiency.
In the light of working results, the framework has carried on great in comparison
with different strategies of a similar kind.
The characteristics of self-organizing neural systems [18] are utilized to enhance
image retrieval strategies. Self-organizing neural systems upheld working of various
leveled quad tree map. The overall structure of the proposed framework is as:
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 73
To compare similarity between two images A and B, there is need to calculate distance
between the two defined as:
+, ∑-,
, . / , ∑-,
, . / , ∑-,
, . /
, ∑
-, , . 0 (7)
where , , are weights to colors and textures. Then new query matrix is formed
using:
" |$| & |'|
1! ( / 1 21! ( / |$| ∑% 3′% |'| ∑% 5′% (8)
where 2, 6 and 7 are parameters to control the relative wrights for current query image,
relevant images and irrelevant images respectively. 3 and 5 are the nearest regions
from the regions of query image. The time complexity for growing hierarchical self-
organizing quad tree map is less as compared to flat structure comparison in images.
For color-based image retrieval, multivariate measurements [19] are provided by
using neural systems. Image database adopted for the experiment is from the Corel
gallery as it contains wide variety of images. The mean ) 1 and variance $ 1|8
for given two multidimensional point samples 9 and 3 of size
and are calculated
as:
) 1 /1 (9)
(
( +(,
:$ 1|8 ;< / )(*)(* == 1 4
/ 2? (10)
()(* (
where =
/ . After calculating the above parameters, for comparing color
distributions, WW Test is carried out by computing:
-.-
@ (11)
/012-|+
Using the Neural-Gas network, the RGB feature vectors of images are extracted and
then similarity comparison is carried out between the query image and the database
images.
Multi-instance strategies [20] for learning user priorities of image scientific
categorizations have been contemplated. For given set of images, the energy function
of the class has been defined as:
(
)9 , @ ln- ∑(3 ∑ ln CD3 (|, . (12)
The framework proposed is prepared to order images in a database as positive and
negative to lie in the intrigued class. After positive and negative sets have been formed,
the learning rules are applied to the corresponding subsets:
Reinforced Learning:
74 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
ሺሻ
ൌ െ ǡ (13)
Anti-reinforced Learning:
ሺሻ
ൌ ǡ (14)
where is a user defined learning rate and is the gradient vector.
The tests were made and the framework was prepared by providing diverse groups
of input images. After a couple iterations of learning, the framework could effectively
sort pictures that lie in class of user's interested images.
Another approach of image retrieval with the use of reasonable weighting [21] in
performing comparability among images by relevance feedback is incorporated with
image's texture elements when connected to the retrieval procedure enhances the
precision. The distance between object set to the query center is formulated as:
Ǥ
Ǥ∑
ൌ ∑ (15)
where is the distance between and , ݄! applies to relevant object, "! is
the original distance between relevant object and original query center, Ԣ is the
combined distance between each object. Consequences of experiments demonstrated
that the level of user satisfaction is improved.
The use of semantic tree [22] for image retrieval process with a different way of
utilizing given query image has been proposed. In this, the semantic tree needs to be
updated after each and every successful retrieval. The tree is updated using the
following algorithm:
ϭͿ /Ĩܶ ൌ ͕݈݈݊ݑŝŶŝƚŝĂůŝnjĞ͖ܶĞůƐĞŐŽƚŽ^ƚĞƉϮ͘
ϮͿ &ŽƌĞĂĐŚ͕݅ŝĨฮ݂ െ ݂ ฮ ൏ ߝ͕ĐŽůůĞĐƚĂůůƚŚĞĐŚŝůĚŶŽĚĞƐŽĨܥ ƚŽƚŚĞ
ƐĞƚܵĂŶĚĂůůŝƚƐůĞĂĨŶŽĚĞƐƚŽƐĞƚ݂ܽ݁͘ܮ
ϯͿ /Ĩ͕ܵ א ݎŐĞƚĂůůŝƚƐĐŚŝůĚŶŽĚĞƐĂƐƌĞůĞǀĂŶƚŝŵĂŐĞƐƚŽ͘ݍdŚĞŶŵŽǀĞ
ƚŽƐƚĞƉϴ͖ĞůƐĞϰ͘
ϰͿ ĂůĐƵůĂƚĞƚŚĞƐŝŵŝůĂƌŝƚLJ݉݅ݏሺݍǡ ݂ܽ݁ܮ ሻ͕ǁŚĞƌĞ݂ܽ݁ܮ ƌĞƉƌĞƐĞŶƚƚŚĞ
ĞůĞŵĞŶƚĨƌŽŵƐĞƚ͘ŶĚƚŚĞŶĚŝƐƉůĂLJƚŚĞŶĞĂƌĞƐƚ<ŝŵĂŐĞƐƚŽƵƐĞƌ͘
ϱͿ ĚĚƚŚĞŝŶĚĞdžĞƐŽĨƌĞůĞǀĂŶƚŝŵĂŐĞƐƚŽݎĂŶĚĂĚĚݎƚŽܥ ǁŚŝĐŚŝƐ
ŶĞĂƌĞƐƚĐůƵƐƚĞƌƚŽ͘ݍ
ϲͿ hƉĚĂƚĞ݂ ŝŶݎĂŶĚƚŚĞĐĞŶƚĞƌŽĨ # ͕ĂŶĚŐŽƚŽƐƚĞƉϴ͘
ϳͿ &ŽƌĞĂĐŚ ͕ŝĨฮ݂ െ ݂ ฮ ൏ ߝ͕ŐĞƚƌĞůĞǀĂŶƚŝŵĂŐĞƐĨƌŽŵǁŚŽůĞƐĞƚ
ĂŶĚĂĚĚݎƚŽƚŚĞŶĞǁůLJĐŽŶƐƚƌƵĐƚĞĚĐůƵƐƚĞƌŶŽĚĞܥĂƐĐŚŝůĚ͘
ϴͿ ŽŶƚŝŶƵĞƌĞƚƌŝĞǀŝŶŐŝŵĂŐĞƐƚŝůůƚŚĞƵƐĞƌΖƐƌĞƋƵŝƌĞŵĞŶƚ͘
The above algorithm clusters the tree in such a way that it forms different semantic
categories for results. The results of the experiments have shown that the performance
of the system increases gradually.
The algorithm has been proposed based on relevance feedback and selection
algorithm [23] following the probabilistic scheme for image retrieval. The algorithm
needs to remember the weight assigned to an image, which is done by gathering the
advantages of probabilistic conceptualization and interaction of user. Further, on the
basis of similarity matching algorithm, the images are classified as positive and
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 75
negative cases. In case of only Positive Examples, the query likelihood needs to be
maximized to select the optimal parameters as:
( 4
∅ logH ∑5 2 ∑ ∑% I% I9 Θ% (16)
where
1, … , L feature indexes, 1, … , = data in the
feature, M 1, … , M
classes for
feature, 9 is the data item of
feature, Θ% represents the set of
parameters. While in case of Positive and Negative Examples, the likelihood is
maximized as:
∅ logH ∑5 2 N 1 (17)
where N and 1 are given as,
( 4 7 6
A plan for reducing the gap between the image semantics and it's low level
features with the utilization of cubic splines models [33, 34] based on neural system
standards have been introduced. The similarity metric used to compare the feature
vector of query image and database image is defined as:
<<<<=.;
; <<<<=
YYYZ,
VL, cos- 5 YYYYZ.
‖;
<<<<=‖@;
<<<<=@
(21)
where YYYZ5 and
YYYYZ are metric vectors for the database image and query image
respectively. When contrasted with other image retrieval strategies, the results obtained
are of high accuracy and better effectiveness.
An adaptive image retrieval framework [35] utilizing relevance feedback has been
proposed. At user input, the utilization of more than once representing sub schemes
can learn and update their comparison criteria with the help of relevance feedback. The
generalized similarity measure has been defined as:
A +0 ' +, DA 0 ΦB -DA .\ ΦB -DA .NA Φ +0 (22)
where ] ∈ 1, =, Φ +0 is the set of scalar function of query image and for database
image DA . If compared to standard generalized Euclidean distance measure, it offers a
more general similarity function.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 79
The image retrieval error can be reduced to zero percent with the adaptive way of
framework. The tests were performed on database of images taken under water and
results demonstrated that the framework is fit for learning through the user query and
learns the criteria effectively to proceed for further image retrievals.
Adapting in the user's interests from log information [36] and decreasing the
relevance feedback time has been analyzed. Log information can be effectively used to
understand the user's intentions and can be utilized for further image retrieval process.
A similar approach is taking in the framework from log information rather than the
user's input. The regularizer selection approach was based on:
* ൌ ‖*‖ ൌ ,∑
ǡ -ǡ (23)
Using Eq. (23), a new Laplacian regularizer has been formulated as:
* ൌ ∑ǡ& െ & (24)
where is unlabeled data information with weight matrix . To calculate distance
metric, a laplacian regularized metric learning proposed as:
. !" !/ * 0 ∑# ǡೕ$א%& െ & െ 0
∑#ǡೕ $ &אെ & (25)
Results are good with little and trained data sets yet are not productive for substantial
and untrained data sets.
An approach for performing content based query in light of gathering of 3D model
databases [37] is exhibited. The plan utilizes 3-level ordering model in light of neural
systems for fruitful retrieval. The total number of spin images at the three levels has
been formulated as:
್ೕ
& ൌ ∑
'ೌೞ '12 ; at first level
& ൌ
(& ; at second level
& ൌ 3
(& ; at third level where is the number of nodes in the
self-organizing map and 3 is the number of clusters produced by the k-means algorithm.
The outcomes demonstrated that the framework performs better on atomic
information and can be utilized as a part of any 3-D information retrieval application
with productive retrieval.
Inspired from text retrieval techniques, image retrieval technique was
implemented by utilizing features of relevance feedback [38]. It combined the two
techniques: query point movement and query expansion. Then, the images were
classified into clusters using k nearest neighbors classifier, for which 4 has been
selected as:
4 ൌ
5# 5ǡ 2 ൌ ͳǣ 3 (26)
where # denotes the cluster number while 3 is a fixed maximum value randomly
selected. The query point is then formulated as:
∑ )*
ೕసభ ೕ ∑ *
ೕసభ +ೕ
67 ൌ െ (27)
where $ and are used for relevant and irrelevant examples respectively.
Experiments performed showed that the combination of above two techniques
produced better results as compared to when any technique used alone.
The automatic weighting of images for relevance feedback-based image retrieval
system [39] imposes the learning of weight specification. The weight updating for
reinforcement learning is carried out as:
80 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
1 aIA (, ∀] c
D( 1: IA ( / 1 `
IA ( / a +1 IA (0 , ]
/ 1 IA (, ∀] c
D( 0: IA ( / 1 d4 (28)
1 IA (, ]
where a is the parameter of gain for positive reward while is the parameter of gain
when reward is 0.
Along with standard precision and recall properties, performance of the system
was measured on error rate defined as:
=. $(
$' (
) 1$(
e($ =.
$' (
It is more reliable where user interaction is least required and can be approximated
by reinforcement learning of the framework. The users do not need to explicitly
mention the weights.
Another two-stage methodology [40] has been proposed in which initial step is to
extract features utilizing low level components (color, shape and texture) while
Support Vector Machine (SVM) classifier is used as a part of the final step to handle
the noisy positive cases. Therefore, a proficient calculation of image retrieval in view
of color-correlogram for color element extraction, wavelet transformation for
separating shape elements and Gabor wavelet for texture element extraction is
presented.
Further, numerous elements and diverse distance measurements are joined to get
image comparability utilizing an SVM classifier. Results of this approach are found
empowering with respect to color, shape and texture image classification exactness.
global scene descriptor: Gist has been considered with the assumption that the selected
elements could verifiably encrypt the high-level information about feelings in light of
their exactness in the distinctive CBIR applications. The proposed methodology has
been experimented on two different databases.
A lot of research has been done in CBIR systems, but in all these systems, the input
query required is a digital image with well defined features such as shape, color and
texture. Sometimes, the user might not have a clear image of his query but may draw a
sketch and search for similar images from database. Such a need requires a new image
retrieval system, i.e. Sketch Based Image Retrieval (SBIR). Few researchers have tried
to bring the topic in light and have carried out remarkable research in the field. Few of
them are described below.
A large number of stable features can be used for transforming image data [42]
into scale-invariant co-ordinates. A set of reference features is stored in a database and
the query sketch/image features are compared to find the matching by Euclidean
distance between them. The keypoint descriptors used here give a good probability
match. Cascade filtering approach is used to detect keypoints. To distinguish among
the false and correct ones, keypoint subsets are formed on the basis of rotation, scale
and orientation. The scale space of an image is given by:
/8ǡ 9ǡ ൌ :8ǡ 9ǡ כ$8ǡ 9 (29)
where $ is the input image and : is defined as Gaussian parameter:
84 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
ͳ మ మ మ
:8ǡ 9ǡ ൌ # , $Ȁ
ʹ;
The local self similarity descriptor [43] which allows differentiating between
internal geometric layout within images or videos has been implemented. It is based on
color, edges and complex textures in a unified way. Textures are compared on the basis
of spatial layout. It calculates the sum of square differences (SSD), providing in output
the correlation surface as:
%% ሺǡ,ሻ
- 8ǡ 9 ൌ <െ = (30)
./0#12ೞ ǡ12ೌೠ
-$
where >-( denotes the changes in color, illumination and >-3( 6 is the
maximum variance among all patches.
The general benchmark [44] for assessing the execution of any SBIR framework
has been characterized. For coordinating pictures SBIR framework utilizes the data
contained in the outlined 3-D shape. They presented the Spark feature for specific
SBIR in a Bag of Visual Words (BoVW) system; Gradient Field - Histogram of
Oriented Gradients (GF-HOG) is an image descriptor reasonable for SBIR. It regards
sketch's edges as a collection of points, as opposed to depicting image patches;
describe lines and their connections to pick out image from dataset. The normalized
histograms were created using:
? ൌ ∑
݄ (31)
ೖ ೕ
Further, to compare the two normalized histograms, distance is calculated as:
ൌ ∑5? 4 െ ?@ 45 (32)
It is much slower patch relationship based approach of Self-Similarity.
The framework that can powerfully create shadows [45] generated or extracted
from huge database of images has been proposed. This framework retrieves related
images continuously in view of inadequate sketches provided by the user. The
gradients are normalized to remove the variations and then binned in groups according
to position, orientation and its length. The normalized gradients A' , is computed as:
A' ൌ 4
(33)
./0# ǡא$
where ' is the original gradient and ̅' is the average gradient. Then, for binning a
new coordinate for each image pixel is defined as:
8Ԣ' 8'
C D ൌ
E' F 9 G (34)
9Ԣ' '
where ࣦ ൌ ͳ െ denotes the logistic loss, 5 denotes the standard
Euclidean distance, I is the controlling parameter, is the projection function's
parameters and 1 and 1 are the margin biases for left-hand and right-hand
respectively.
Another way is to calculate the conditional probability [47] and using Bayes'
theorem to find the merging of the feature vectors. The conditional probability for
determining the likelihood of two different feature vectors is as:
8ǡ 9 ൌ %9 אJ ȁ9 ת * אK (36)
where J is the set of feature vectors which are common to local set as well as global
set. Then the Bayes' theorem can be formulated as:
'
ת7ȁೣ '
ೣ
%J ȁ* תK ൌ ͳ Ǥ (37)
'
ת7ȁೣ '
ೣ
Here L is the set of the feature vectors which does not fall in J either because they
are only global or local features.
The problem of fast, large scale database searching [48] for over one million
images have resulted in development of SBIR to ease the user's understanding.
Existing image retrieval frameworks does not allow supervising the steps involved in
searching. Two techniques are proposed and on performing experiments, it was found
that they significantly outperform existing approaches. The descriptors are constructed
in such a way that the sketch as well as color image have to undergo the same
preprocessing level. To analyze gradient orientations and directions, images with
similar structure are found. The histograms are constructed using Eq. (31) for HOG.
While for tensor descriptor, unlike HOG, it is portrayed as maximization problem to
find a single vector for each cell, which is defined as:
8 ൌ - -8‖‖ ∑
3ǡ1א9ǡೕ8 31 (38)
such that
M 8 31
ൌ M 8
31 31 8 ൌ 8 N M
31 31 O 8 ൌ 8 : 8
3ǡ1א9ǡೕ
3ǡ1א9ǡೕ
3ǡ1א9ǡೕ
where : is the sum of products of gradients. Then, best matching images are
clustered based on dominant color distributions, to offset the lack of color-based
decision at the time of initial search.
A bag-of-regions [49] has been presented to construct a SBIR framework. This
framework encodes the notable shapes at different levels of points of interest as
enclosed contours of locales. They proposed BoW system that depends on GF-HOG,
SIFT and Structural Similarity Index (SSIM) descriptor to restrict the sketched object
inside the retrieved image. Frequency histograms ?+ and ?: are first constructed for
database image $ and the query sketch P respectively, which are then compared to find
the ranking of regions as:
? : ǡ ? + ൌ ∑ ∑
Q
?: ǡ ? + 2 (39)
Q ൌ ͳ െ | : െ + 2|
where ? + is the ith bin of histogram, is the normalized visual word corresponding
to ith bin. Then the matching score Ȱ
S ೖ ǡ S ೞ between region of query image and
mapped region with bounding box K and K is computed as:
మ
భ
∑ ;<ೃ ൫ ൯ < ;
ಿ ቄ ǡೕ ቅאು ೞ ೃೖ ቀೕ ቁ
Ȱ
S ೖ ǡ S ೞ ൌ (40)
where
is the cardinality of the correspondence set.
86 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
The problem of fast, large scale sketch-based image retrieval, [50] searching in a
database of over one million images was addressed. According to this paper, current
retrieval methods do not scale well towards large databases in the context of
interactively supervised search and propose two different approaches for which
objectively evaluation is carried out that they significantly outperform existing
approaches. The proposed descriptors are constructed such that both the full color
image and the sketch undergo exactly the same preprocessing steps. First searching is
done for an image with similar structure, analyzing gradient orientations. Then, best
matching images are clustered based on dominant color distributions, to offset the lack
of color-based decision during the initial search. When images have been retrieved
from the system, there is need to rank them on the basis of similarity measures. The
Kendall's rank correlation coefficient ) is computed as:
)ൌ (41)
Ȁ
where .= and .
are number of concordant and discordant pairs respectively.
In case of large databases, it might happen that same coefficient value is generated
for two different images in the database, thus creating ties. Thus, in case of multiple
images at same ranking score, the ranking coefficient as defined in Eq. (41) is modified
as follows:
)& ൌ >
?
@A మ (42)
where
ൌ .. െ ͳȀʹ (number of possible pairs in a set of . distinct elements),
ൌ ∑ ! ! െ ͳȀʹ (number of ties in first list),
T ൌ ∑3 ( ( െ ͳȀʹ (number of ties in second list).
Indexing structure [51] and raw shape-based coordinating algorithm [52] to
compute similarities amongst natural and sketch images query, and make SBIR
adaptable to different types of images was presented. The similarity from image U to V
is calculated as:
B՜C ൌ |B| ∑'אB ? !C % (43)
where % is an edged pixel. Using Eq. (43), the Structure-consistent Sketch Matching
(SSM) [51] has been defined as:
భ భ
%%E ࣫ మಿ మಿ
CǡB ൌ ∏ C ՜B ࣫ Ǥ ∏ ࣞ
Bೕ ՜C ࣞ (44)
A visual word is portrayed utilizing a triple (x, y, θ) of the position x and y and
angle θ. It empowers a very effective inverted index structure and makes possible to
assemble a real-time huge-scale sketch based image search framework. It additionally
endures local irregularities in user's sketch query and introduced a modified version of
Eq. (44) with Oriented Chamfer Matching (OCM) as follows:
F9E
ǡ7 ൌ (X 7 >̅ >̅7 (X (45)
Researchers compensate the absence of position-invariant matching, as in a large-
scale search framework.
A sketch based algorithm [53] for large scale image retrieval and a framework is
created to query output from database of millions of images was proposed. Two
candidate areas are utilized for components extraction, firstly orientation elements are
extracted and after that these elements are composed hierarchically to produce global
to local components. The contour saliency map is as:
ൌ ∑ Fೕ (46)
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 87
such that
7
Accordingly, various leveled database list is organized and after that image can be
retrieved from large-scale image database on the web. This framework shifts through a
substantial number of irrelevant images rapidly.
To recognize similarities [54] between a hand drawn sketch and the natural images
in a database is an important process. They utilized the voting procedure to distinguish
duplicate shape and structure patches. Local Sensitive Hashing procedure is
appropriate for evaluating the likeness between sets. The similarity between two sets
has been defined as:
|F ∩F |
'
f , f |F | ∈ 0,1 (48)
∪F
In this framework, effect of noisy edges in the image degrades the retrieval
performance.
88 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
A tensor-based image descriptor [55, 56] for huge scale SBIR was built-up. It
locates a single vector utilizing tensor-based picture descriptor, which is nearest to the
majority of parallel edges course direction in a neighborhood area. Though, Sousa et al.
utilized topological data to retrieve vector drawings and geometric data is utilized for
coordinating shapes. The final effect of trial result is little furthermore there is need of
change in geometric sifting.
An image retrieval system for the interactive search [57] of photo collections using
free-hand sketches depicting shape has been developed. Described Gradient Field
HOG (GF-HOG); an adapted form of the HOG descriptor suitable for SBIR. Also
incorporated GF-HOG into a Bag of Visual Words (BoVW) retrieval framework, and
demonstrate how this combination may be harnessed both for robust SBIR, and for
localizing sketched objects within an image. The similarity comparison for two sets of
tags of images is computed as:
∑ಾ భ మ
సభ G'#9 ȁ9 $H ∑ಿ భ మ
సభ G'#9 ȁ9 $H
(49)
E
with %# ȁ#
is computed as:
5 ת5
ȁ ൌ
5 5
where is the set of keywords for image tags while
and Y are the total number of
tag elements in sets # and # respectively.
Evaluation has been done over a large Flickr sourced dataset comprising 33 shape
categories, using queries from 10 non-expert sketchers. GF-HOG is compared against
state-of-the-art descriptors with common distance measures and language models for
image retrieval, and explore how affine deformation of the sketch impacts search
performance.
Bag-of-features [58] model to develop feature set for sketches can be utilized. In
this large dataset of sketches gathered to assess human recognition. Certain restrictions
with this framework are of importance for specific components that may be dependent
on context with the sketch, very predictable for specific sorts of sketches as it contains
transient data about sketch. Hand-drawn sketch in light of stroke components are
presented for keeping up basic data of visual words codebook is maintained in a
hierarchical tree. The corners are detected using:
8ǡ 9 ൌ ∑3ǡ1 (ǡ >Z$8 (ǡ 9 > െ $8ǡ 9[ (50)
where (ǡ > is the Gaussian kernel. Later, to compare similarity between sketches
were computed using \ kernel as:
మ
I
ೝ
J
݄- ǡ ݄2 ൌ ∑ (51)
ೝ
where ݄- and ݄2 are descriptors for given sketch pairs. And then, rank retrievals are
followed as:
-.44 ൌ ̰
ǡǥǡ
where ̰ is the distance matrix.
A strategy to lessen the appearance gap [59] in SBIR was proposed. In this paper,
sketches and extracted edges are dealt with as set of line sections, establishing the
framework for better portray edge depiction and noise effect decrease from another
point. Object boundary selection algorithm is utilized to lessen the effect of noisy
edges. In this paper, spatial and coherent constraints are proposed to channel the false
matches that corrupt the retrieval throughput.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 89
Using convolutional neural networks [60], from 2-D human sketches, 3-D models
can be retrieved. Since majority of models are upright, therefore number of views per
object for dataset has been reduced to two. The model is based on Siamese
Convolution Neural Networks to capture different properties in different domains.
Also, an 'align' function has been defined to compare features of query sketch and 3-D
models. The function has been defined as:
H' , ' , u 1 u2fI / u6 &F (52)
where ' and ' are two samples, u is the binary similarity label, fI is the distance
between two samples. In each domain, samples are mapped to some feature vectors.
The cross domain matching is successful if features from each domain are 'aligned'
correctly.
A step of sketch generation on real images is used to obtain a better edgemap [61]
for each image. A novel multi label graph cut algorithm is used to group edges
according to two key Gestalt principles, i.e., continuity and proximity. Perceptual edge
grouping framework produces better edge groups which can be used to improve a
variety of higher level tasks. Edge grouping performed should minimize the overall
energy function defined by:
)J ∑; ∈0 f , J / ∑L; ,; M∈( V- , A . (53)
where
f , J '
-rD. '
-vB D , J .
V , J - , A .
where J is a set of edges, f , J is the measure of fitness between edge and
assigned cluster J .
Sketch-based image retrieval (SBIR) is a challenging task due to the ambiguity
inherent in sketches when compared with photos. A novel convolutional neural
network [62] based on Siamese network for SBIR has been implemented. The main
idea is to pull output feature vectors closer for input sketch-image pairs that are labeled
as similar, and push them away if irrelevant. Therefore, the similarity metric is defined
as:
wN V, L ‖rN V rN L‖ (54)
where V is the sketch and L is the database image edgemap. A loss function is then
defined on parameter @ as:
H@ ∑(
@, V, L, 3 (55)
@, V, L, 3 1 3HO wN V, L / 3H( wN V, L
where V, L, 3 is the
training sample, HO is the loss function for positive pair
while H( is the loss function for negative pair. This is achieved by jointly tuning two
convolutional neural networks which linked by one loss function.
Table 2. Comparison results for various SBIR techniques
Methods Vocabulary Size MAP
Several results included in the above table shows the Mean Average Precision
(MAP) obtained for different techniques. SBIR have been mostly implemented on
Flickr dataset of sketch-image pairs which consists of 33 categories of sketches. The
MAP is calculated on the overall database by considering each of the sketch as query
sketch.
4. Conclusion
References
[1] N. Vázquez, M. Nakano & H. P. Meana, "Automatic System for Localization and Recognition of
Vehicle Plate Numbers", Journal of Applied Research and Technology, pp. 63-77, 2002.
[2] L. F. Pulido, O. Startostenko, D. F. Quéchol, J. I. R. Flores, I. Kirchning & J. A. C. Aragón, "Content-
Based Image Retrieval Using Wavelets", 2nd WSEAS International Conference on Computer
Engineering and Applications, pp. 40-45, 2008.
[3] S. Wang & H. Qin, "A Study of Order-based Block Color Feature Image Retrieval Compared with
Cumulative Color Histogram Method", 6th International Conference on Fuzzy Systems and Knowledge
Discovery, Vol. 1, pp. 81-84, 2009.
[4] H. Tamura, S. Mori & T. Yamawaki, "Textural Features Corresponding to Visual Perception", IEEE
Transactions on Systems, Man and Cybernetics, Vol. 8, No. 6, pp. 460-473, 1978.
[5] S. K. Saha, A. K. Das & B. Chanda, "CBIR Using Perception Based Texture And Colour measures",
Proceedings of the 17th International Conference on Pattern Recognition, Vol. 2, pp. 985-988, 2004.
[6] J. Canny, "A computational approach to edge detection", IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. PAMI-8, No. 6, pp. 679-698, 1986.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 91
[7] S.Nandagopalan, Dr. B.S. Adiga, & N. Deepak, "A Universal Model for Content-Based Image
Retrieval", International Journal of Computer, Electrical, Automation, Control and Information
Engineering, Vol. 2, No. 10, pp. 3436-3439, 2008.
[8] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafher, D. Lee, D.
Petkovic, D. Steele & P. Yanker, "Query By Image and Video Content: The QBIC System", Computer,
Vol. 28, No. 9, 1995.
[9] P. Charlton, B. Huet & G. Justog, "Modelling Agents in C++CL for Content-Based Image Retrieval",
Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing, pp. 59-66, 1996.
[10] B. Verma, P. Sharma, S. Kulkarni & H. Selvaraj, "An Intelligent On-line System for Content Based
Image Retrieval", 3rd International Conference on Computational Intelligence and Multimedia
Applications, pp. 273-277, 1999.
[11] H. K. Lee & S. I. Yoo., "A neural network-based image retrieval using nonlinear combination of
heterogeneous features", Proceedings of the 2000 Congress on Evolutionary Computation, Vol. 1, pp.
667-674, 2000.
[12] J. H. Lim, W. J. Kang, S. Singh & D. Narasimhalu, "Learning Similarity Matching in Multimedia
Content-Based Retrieval", IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 5, pp.
846-850, 2001.
[13] X. L. Li, "Content-based image retrieval system with new low-level features, new similarity metric,
and novel feedback learning", International Conference on Machine Learning and Cybernetics, Vol. 2,
pp. 1126-1132, 2002.
[14] P. M. Willy & K. H. Kufer. “Content-based Medical Image Retrieval (CBMIR): An Intelligent
Retrieval System for Handling Multiple Organs of Interest”, 17th IEEE Symposium on Computer-
Based Medical Systems, pp. 103-108, 2004.
[15] B. Li & S. Yuan, "A novel relevance feedback method in content-based image retrieval", International
Conference on Information Technology: Coding and Computing, Vol. 2, pp. 120-123, 2004.
[16] G. Mori, S. Belongie & J. Malik, "Efficient shape matching using shape contexts", IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. 27, No. 11, pp. 1832-1837, 2005.
[17] R. Vermilyer, "Intelligent User Interface Agents in Content-Based Image Retrieval", Proceedings of
the IEEE SoutheastCon., pp. 136-142, 2006.
[18] S. Wu, M. K. M. Rahman & T. W. S. Chow, "Content-based image retrieval using growing
hierarchical self-organizing quad tree map", Pattern Recognition, Vol. 38, No. 5, pp. 707-722, 2005.
[19] C. Theoharatos, N. Laskaris, G. Economou, S. Fotopoulos, "Combining self-organizing neural nets
with multivariate statistics for efficient color image retrieval", Computer Vision & Image
Understanding, Vol. 102, No. 3, pp. 250-258, 2006.
[20] S. C. Chuang, Y. Y. Xu, H. C. Fu & H. C. Huang, "Multiple-Instance Neural Networks based Image
Content Retrieval System", 1st International Conference on Innovative Computing, Information and
Control, Vol. 2, pp. 412-415, 2006.
[21] A. J. M. Traina, J. Marques & C. Traina, "Fighting the Semantic Gap on CBIR Systems through New
Relevance Feedback Techniques", 19th IEEE International Symposium on Computer-Based Medical
Systems, pp. 881-886, 2006.
[22] X. X. Xie, Y. Zhao & Z. F. Zhu, "An Anamnestic Semantic Tree-Based Relevance Feedback Method
in CBIR System", 1st International Conference on Innovative Computing, Information and Control, Vol.
3, pp. 91-94, 2006.
[23] M. L. Kherfi & D. Ziou, "Relevance feedback for CBIR: a new approach based on probabilistic feature
weighting with positive and negative examples", IEEE Transactions on Image Processing, Vol. 15, No.
4, pp. 1017-1030, 2006.
[24] S. C. H. Hoi, M. R. Lyu, & R. Jin, "A unified log-based relevance feedback scheme for image
retrieval", IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 4, pp. 509-524, 2006.
[25] P. S. Karthik & C. V. Jawahar, "Analysis of Relevance Feedback in Content Based Image Retrieval",
9th International Conference on Control, Automation, Robotics and Vision, pp. 1-6, 2006.
[26] B. Wang, X. Zhang & N. Li, "Relevance Feedback Technique for Content-Based Image Retrieval
using Neural Network Learning", International Conference on Machine Learning and Cybernetics, pp.
3692-396, 2006.
[27] G. Das & S. Ray, "A Comparison of Relevance Feedback Strategies in CBIR", 6th IEEE/ACIS
International Conference on Computer and Information Science, pp. 100-105, 2007.
[28] Q. Li, S. Luo & Z. Shi, "Semantics-Based Art Image Retrieval Using Linguistic Variable", 4th
International Conference on Fuzzy Systems and Knowledge Discovery, Vol. 2, pp. 406-410, 2007.
[29] Y. Zhu, X. Liu & W. Mio, "Content-Based Image Categorization and Retrieval using Neural
Networks", IEEE International Conference on Multimedia and Expo, pp. 528-531, 2007.
92 R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques
[30] X. Wu & D. Fu, "Apply hybrid method of relevance feedback and EMD algorithm in a color feature
extraction CBIR system", International Conference on Audio, Language and Image Processing , pp.
163-166, 2008.
[31] T. Zhu, W. W. Y. Ng, J. W. T. Lee, B. B. Sun, J. Wang & D. S. Yeung, "L-gem based co-training for
CBIR with relevance feedback", International Conference on Wavelet Analysis and Pattern Recognition,
Vol. 2, pp. 873-879, 2008.
[32] H. S. Cruz & E. Bribiesca, "Polygonal Approximation of Contour Shapes using Corner Detectors",
Journal of Applied Research and Technology, Vol. 7, No. 3, pp. 275-291, 2009.
[33] S. Sadek, A. A. Hamadi, B. Michaelis & U. Sayed, "Cubic-splines neural network based system for
image retrieval", 16th IEEE International Conference on Image Processing, pp. 273-276, 2009.
[34] S. Sadek, A. A. Hamadi, B. Michaelis & U. Sayed, "Cubic-splines neural network based system for
image retrieval", 16th IEEE International Conference on Image Processing, pp. 273-276, 2009.
[35] M. R. A. Sadjadi, J. Salazar & Srinivasan, "An Adaptable Image Retrieval System With Relevance
Feedback Using Kernel Machines and Selective Sampling", IEEE Transactions on Image Processing,
Vol. 18, No. 7, pp. 1645-1659, 2009.
[36] W. Liu & W. Li, "A Novel Semi-Supervised Learning for Collaborative Image Retrieval",
International Conference on Computational Intelligence and Software Engineering (CISE), pp. 1-4,
2009.
[37] P. A. de Alarcón, A. D. P. Montano & J. M. Carazo, "Spin Images and Neural Networks for Efficient
Content-Based Retrieval in 3D Object Databases", International Conference on Image and Video
Retrieval, pp. 225-234, 2002.
[38] N. V. Nguyen, A. Boucher, J. M. Ogier & S. Tabbone, "Clusters-Based Relevance Feedback for CBIR:
A Combination of Query Movement and Query Expansion", IEEE RIVF International Conference on
Computing and Communication Technologies, Research, Innovation, and Vision for the Future, pp. 1-6,
2010.
[39] Z. Shoaie & S. Jini, "Semantic image retrieval using relevance feedback and reinforcement learning
algorithm", 5th International Symposium on I/V Communications and Mobile Network, pp. 1-4, 2010.
[40] K. Sugamya, S. Pabboju & A. V. Babu, "A CBIR Classification Using Support Vector Machines",
International Conference on Advances in Human Machine Interaction, pp. 1-6, 2016.
[41] S. Gbèhounou, F. Lecellier & C. F. Maloigne, "Evaluation of local and global descriptors for emotional
impact recognition", Journal of Visual Communication and Image Representation, Vol. 38, pp. 276-283,
2016.
[42] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints", International Journal of
Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.
[43] E. Shechtman & M. Irani, "Matching Local Self-Similarities across Images and Videos", IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[44] M. Eitz, K. Hildebrand, T. Boubekeur & M. Alexa, "A descriptor for large scale image retrieval based
on sketched feature lines", Proceedings of the 6th Eurographics Symposium on Sketch-Based Interfaces
and Modeling, pp. 29-36, 2009.
[45] C. L. Zitnick, "Binary coherent edge descriptors", Proceedings of the 11th European Conference on
Computer Vision: Part II, pp. 170-182, 2010.
[46] J. Philbin, M. Isard, J. Sivic & A. Zisserman, "Descriptor learning for efficient retrieval", Proceedings
of the 11th European Conference on Computer Vision conference on computer vision: Part III, pp. 677-
691, 2010.
[47] L. Zheng, S. Wang, W. Zhou & Q. Tian, "Bayes merging of multiple vocabularies for scalable image
retrieval", IEEE Conference on Computer Vision and Pattern Recognition, pp. 1963-1970, 2014.
[48] M. Eitz, K. Hildebrand, T. Boubekeur & M. Alexa, "An evaluation of descriptors for large-scale image
retrieval from sketched feature lines", Computers and Graphics, Vol. 34, No. 5, pp. 482-498, 2010.
[49] R. Hu, T. Wang & J. Collomosse, "A bag-of-regions approach to sketch-based image retrieval", 18th
IEEE International Conference on Image Processing, pp. 3661-3664, 2011.
[50] M. Eitz, K. Hildebrand, T. Boubekeur & M. Alexa, "Sketch Based Image Retrieval: Benchmark and
Bag-Of-Features Descriptors", IEEE Transactions on Visualization and Computer Graphics, Vol. 17,
No. 11, pp. 1624-1636, 2011.
[51] Y. Cao, C. Wang, L. Zhang & L. Zhang, "Edgel index for large-scale sketch-based image search",
IEEE Conference on Computer Vision and Pattern Recognition, pp. 761-768, 2011.
[52] X. Sun, C. Wang, C. Xu & L. Zhang, "Indexing billions of images for sketch-based retrieval",
Proceedings of the 21st ACM International Conference on Multimedia, pp. 233-242, 2013.
[53] R. Zhou, L. Chen & L. Zhang, "Sketch-based image retrieval on a large scale database", Proceedings
of the 20th ACM International Conference on Multimedia, pp. 973-976, 2012.
R. Kumar Singh et al. / Intelligent Image Retrieval via Deep Learning Techniques 93
[54] K. Bozas & E. Izquierdo, "Large scale sketch based image retrieval using patch hashing", Advances in
Visual Computing, pp. 210-219, 2012.
[55] M. Eitz, J. Hays & M. Alexa, "How do humans sketch objects", ACM Transactions on Graphics, Vol.
31, No. 4, 2012.
[56] P. Sousa & M. J. Fonseca, "Sketch-based retrieval of drawings using spatial proximity", Journal of
Visual Languages and Computing, Vol. 21, No. 2, pp. 69-80, 2010.
[57] R. Hua & J Collomosse, "A Performance Evaluation of Gradient Field HOG Descriptor for Sketch
Based Image Retrieval", Journal of Computer Vision and Image Understanding, Vol. 117, No. 7, pp.
790-806, 2013.
[58] C. Ma, X. Yang, C. Zhang, X. Ruan & M.-H. Yang, "Sketch retrieval via dense stroke features", Image
and Vision Computing, Vol. 46, pp. 64-73, 2016.
[59] S. Wang, J. Zhang, T. X. Han & Z. Miao, "Sketch-Based Image Retrieval Through Hypothesis-Driven
Object Boundary Selection With HLR Descriptor", IEEE Transactions on Multimedia, Vol. 17, No. 7,
pp. 1045-1057, 2015.
[60] F. Wang, L. Kang & Y. Li, "Sketch-based 3d shape retrieval using convolutional neural networks",
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875-1883, 2015.
[61] Y. Qi, Y. Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li & J. Guo, "Making better use of edges via
perceptual grouping", IEEE Conference on Computer Vision and Pattern Recognition, pp. 1856-1865,
2015.
[62] Y. Qi, Y. Z. Song, H. Zhang & J. Liu, "Sketch-Based Image Retrieval via Siamese Convolutional
Neural Network", IEEE International Conference on Image Processing, pp. 2460-2464, 2016.
[63] K. Porkaew, M. Ortega & S. Mehrota, "Query reformulation for content based multimedia retrieval in
MARS", Vol. 2, pp. 747-751, 1999.
[64] G. Das, S. Ray & C. Wilson, "Feature re-weighting in content-based image retrieval", International
Conference on Image and Video Retrieval, pp. 193-200, 2006.
[65] N. Dalal & B. Triggs, "Histograms of oriented gradients for human detection", IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 886-893, 2005.
[66] E. Shechtman & M. Irani, "Matching local self similarities across images and videos", IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007.
94 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-94
Abstract. Almost all of us are tempted to take sweets due to its pleasant nature.
When it is overused, it will affect our body entirely. Diabetic is a disease that
occurs when the blood glucose level is high. According to the study of World
Health Organization (WHO), the prevalence percentage of diabetic persons is
doubled in the last 10 years. Life style, working environment, nature of the work,
food habits and hereditary are few reasons for diabetic. Diabetic leads to various
health problems like heart disease, stroke, kidney problems, nerves damage, eye
and dental problems over time. Stevia is a sugar substitute which is available all
over the world and it is proved to give more safety for diabetic patients. Stevia
contains proteins, vitamins and minerals. Stevia plant may be affected by various
diseases such as root rot, charcoal rot, wilt, leaf spot disease and so on. This
chapter demonstrates the deep learning approach to enable the disease detection
through image recognition. A deep convolutional neural network is trained to
classify the disease affected leaves, achieving the accuracy of over 99%.
1. Introduction
Stevia is extracted from the herb Stevia rebaudiana which is widely used as a sugar
substitute for diabetics. More than thirty million people have been diagnosed with
diabetic in India. In rural areas, the prevalence is approximately three percent of the
total population. The Impaired Glucose Tolerance (IGT) is also a big health issue in
India. The design of a software-integrated tool using deep learning and image
processing can greatly improve the Stevia production and maintain the quality of the
product. Deep learning comprises artificial neural networks and machine learning
algorithms. Models can be created for representing large scale data in image processing
applications using deep learning algorithms. Generally deep learning algorithms show
good performance results in recognition and mining applications.
CNNs have been successfully used in several domains for mapping between an
input vector (such as a diseased vegetal picture) to an output e.g. a crop disease feature
vector. The nodes in a neural network implement mathematical functions that take
numerical inputs from the incoming edges, and deliver numerical outputs. Deep neural
networks just map the input layer to the output layer through a series of stacked (or
S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning 95
hidden) layers of nodes. The main challenge is to design the structure of the network as
well as the functions (nodes) and edge weights that correctly map the input to the
output. CNNs can be trained by tuning the network parameters so that the mapping
progresses during the training process, which is computationally demanding.
This chapter examines CNNs applied to plant disease analysis and it is structured
as follows. Section 2 portrays the problem. Section 3 discusses the application of
computational intelligence in agricultural pest control. Section 4 presents the proposed
methodology in detail. Section 5 describes the results and discussions and Section 6
provides the intended future work. Finally, Section 7gives the concluding remarks.
2. Problem Characterization
Stevia is a gift of God to the Diabetic patients and the botanical name is Stevia
Rebaudiana Bertoni. It may also be called “sweet leaf”,“sugar leaf” or“honey leaf”. It
belongs to Asteraceae (sunflower) family and it contains a glycoside in its leaves.
Carbohydrates and fat contents are very less in stevia. It grows in the warmest areas of
the United States, Paraguay and Brazil. It was originally used in Paraguay to sweeten
the tea. It has been cultivated in many of the Indian states like Rajasthan, Maharashtra,
Kerala and Orissa recently.
Among the 150 species of stevua, S.rebaudiana is the sweetest and has been
described as a ‘wonder plant’ or the sweetness of the future owing to its calorie-free
natural sweetness. It has been reported to have various nutritional and medicinal
properties, such as antimicrobial, antiviral, antifungal, antihypertensive, anti-
hyperglycaemic and antitumor effects [5][8]. The stevia leaves are shown in
Figure.1.The leaves of a stevia plants contain
• gulcoside 0.3%
• rebaudioside C 0.6%
• rebaudioside A 3.8% and
• Stevioside 9.1%.
Stevia contains proteins, vitamins and minerals. It is also used in folk medicine to treat
hypertension and diabetes.Generally, doctors are used to give suggestions for taking
more fiber and protein contents for diabetic patients. It is also used to prevent the tooth
decay and helps to rejuvenate the pancreatic gland.
The farmers in India turn their attention towards stevia due to the following reasons:
Stevia used as sweet substitute in food products such as seafood, soft drinks and
candies in Japan and Korea. Initially stevia was banned by US government later on it is
approved as food additive in 2008.
The extracted dried white powder from stevia leaves are the sweeter than other
forms of stevia which is available as teabags in supermarkets. The sweetness of stevia
powder depends on the refinement process. Stevia plant may be affected by various
diseases such as root rot, charcoal rot, wilt, leaf spot disease and so onwhich can cause
significant economic, social and environment loss globally.It is highly difficult to
monitor continuously and detect the disease at the earliest stage.
The growing field is very important for any plant. The basic requirements for
stevia cultivation is listed below:
1. Bed size - 120cm
2. Distance between the plants – 30cm
3. Height - 25cm
4. Width – 50 to 60 cm
5. Rainfall – 150cm
6. Temperature – 30oC to 32oC
7. pH range – between 6.5 and 7.5
8. Soil - red and sandy loam soil
The plant spreads over by seeds, tissue culture or stem cuttings. Seed type cultivation is
not an advisable one since it takes forty days to sixty days before transplanting to the
field. Tissue culture is expensive. Hence stem cutting is the cheaper and the easier one
which will take less time when compare with other technologies. A sample rectangular
green house of stevia plant is shown in Figure.2.
S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning 97
Fertilization:
Generally, fertilization played a major role for raising the production in agriculture.
Organic and inorganic fertilizers are used for stevia plant growth. The processed
chicken manure used as basal and side dressing fertilizers.
Weed Control:
The proper procedure for planting stevia is very important to control the growth of
weeds in the field. Manual weeding method is used to control the weeds nevertheless
costly, time consuming one. The side dressing fertilizers are used to boost up the
energy to grow stevia and also to control the weeds in the fields. Biodegradable
mulches such as leaves, grass clippings help to control the growth of weed.
Water Management:
The supply of water should be consistent while growing the stevia. It is advisable
to use sprinkler irrigation system for producing stevia. [20]
Harvesting:
The harvesting could be done, when the flower buds start to grow the vegetative
growth will stop. The consistent intervals between the harvesting are normally 30 to 35
days. The yield will be 20 to 50 grams per plant. [11] [20]
After harvesting the plant should be dried. Drying should be done properly. In case,
any delay in the drying process which will definitely affect the quality of the stevia.
The steviosides content is reduced by 33% when the drying process is delayed for
minimum three days. The moisture content should be less than 10% to store stevia.
Normally stevia plants used to produce good yield up to five years then that will be
replaced with new one. According to the financial giant Rabobank, Stevia sales are
predicted to reach about 700 million dollars in the next few years.
98 S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning
Stevia plants are affected by stem rot diseases. Since stevia acts as a sugar
substitute and animals like sweet, there is no chance of controlling the pests in the field.
Nevertheless whiteflies, aphids and thrips are the problems in green houses. There are
several pathologies like fungal, bacteria and virus that affects the stevia plants in
different parts like roots, seedling, stem, leaves and fruits [13] [14]. Some of them are
listed here with detailed explanation.
i.Alternaria alternata
It is a fugus which is the source of causing leaf spot. It can enter through the
stomata or penetrate directly from the top of the leaf using the aspersorium and it is
shown in the Figure.3. It is very common in medicinal plants cultivated in various
districts of West Bengal, India. Symptoms of leaf spot disease initially appear as small
brown colour circular spots and then it may develop to dark brown in irregular shapes.
Infections are spreading from small circles to large necrotic areas. These concentric
dark brown spots are commonly at the tips of the leaves.
Figure.3:Alternaria alternata
The stevia plants can also be affected by white mould disease due to pathogenic
fungus which is shown in in Figure.4. It is also called as cottony rot or watery soft rot.
It leads heavy loss when it is identified in the field. It is always advisable to detect this
disease earlier so that necessary steps can be taken to avoid the loses.
(iii)Sclerotium rolfsii
(iv)Septoria steviae
Mostly this disease affects the plant in warm weather. It affects the leaves by light
brown colour. When the stevia is affected by this disease, first light brown colour spots
are identified and then slowly extended to yellow colour haloes in outer leaves which is
shown in Figure.6. Then the colour of the leave turns into gray in the middle portion of
the leaf. When the severity rises the leaves may die. The disease may be transmitted
through seeds.
(v)Rhizoctonia sp.
Sunken reddish spots, gradually expand to kill the plants which is shown in Figure.7.
The early stage of plants is affected and the colour is changed to reddish brown which
prevent the regular growth of plants.
Figure.7. Rhizoctonia sp
It is a small insect with lean month through which it sucks out the fluids in the plant
which is shown in Figure.8. It is very hard to differentiate with other insects. It is
always better to detect this as early as possible to reduce the loses.
(vii) Whiteflyis
Mostly, this is found in the bottom of the leaves which could damage the plants by
sucking the liquids and changing the colour of the leaf to yellow. It leads to reduce the
yield by making the plants to weak and create an environment to spend the diseases
easily. The Whitefly image is shown in the Figure.9.
S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning 101
Figure.9 Whiteflyis
In India, the main part of our income depends on agriculture. Unexpected diseases
reduce the growth of plants which will leads us in heavy loss. Recent developments in
technology will act as an aid to identify the diseases earlier so that we can reduce the
losses to some extent through which we can raise the production. Hence, it is really
useful to our society to have an automated tool to detect the diseases earlier for raising
our profit.
It is easy to monitor the entire field if we have an automated tool. This enables
machine vision to provide automatic inspection which is more accurate than the visual
inspection of fields.
The unwanted noisy details are removed by using various filtering techniques then
the images are enhanced to maintain the fine details of the images. The image
segmentation techniques are applied to extract the boundaries of the leaves. Finally, the
classification techniques are used to identify and recognize the plant diseases. There are
various algorithms are used such as neural network, support vector machine and rule
based classification. Color is used to identify the following diseases in [1]:
1. Cottony mold
2. Late scorch
3. Scorch
4. Ashen mold and
5. Tiny whiteness
The details of the texture are calculated by using the co-occurrence matrix and they are
fed into the neural network to detect the diseases.
102 S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning
Mohan et al., [9] introduced a new system to classify different diseases from paddy
leaf images by using scale invariant feature transform(SIFT) feature and then they are
classified by using KNN and SVM.
In general, diseases can be identified easily by the experts easily using naked eyes.
Farmers are unaware of certain diseases. However, consulting experts is too expensive
and time consuming one [1]. As an example, the grapefruit peel maybe infected by
several diseases [3]and it can be identified using the squared distance technique [1].
Using high-resolution multispectral and stereo images leaf diseases are classified
automatically [17].
Image segmentation techniques can be used to extract the disease affected part of
the plants. An ideal threshold value can be calculated using parson window for
segmenting the images [21]. Hence, in this work we will use the convolution neural
network (CNN) for disease detection.
4.Proposed Method
We present a deep learning approach to identify the disease affected stevia leaves.
Figure.11 depicts the architecture of the proposed work. It consists of two important
steps:
Using web camera stevia plant images are captured from the field regularly. The
acquired images are stored for analyzing the growth details. Each image is represented
by using RGB combinations. The size of the images is altered for making the further
process easier. It can be down in MATLAB by using the immersive () function. Each
image is resized and stored in a database for testing.
The acquired images are brought to the MATLAB environment by imread() function.
The syntax of
I = imread(‘filename’);
where the variable I can store the image file values in matrix format.
The image set is generated by capturing the stevia field directly. It may contain the
images of healthy as well as unhealthy leaves. When we are dealing with real world
images, definitely some percentage of noisy information i.e. unwanted information is
added by the time of capturing the images. It is necessary to remove the unwanted
information before processing the image so that we can get accurate results. In the
literature, we have many filtering techniques such as mean filter, median filter,
adaptive mean filter, weighted mean filter, wiener filter and so on. We have applied
various filters and analyzed the results then we apply median filter for removing the
noise from stevia plant images. The syntax for median filter is
104 S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning
J = medfilt2(I);
where I is the noisy image and J is the filtered image i.e., denoised image.
Even though color played a major role for disease identification, we have applied our
CNN method in gray scale images too for evaluating the results. The color images are
converted to gray scale by using the function rgb2gray ( ). Then, the diseases are
identified and classified with the help of convolutional neural network.
Convolutional neural networks are based on the biological processes. Its uses are
not at all a limited one. Some of the application areas of this CNN is listed below:
• Image processing
• Video processing
• Pattern recognition
• Recommender systems and
• Natural language processing
Artificial Neural Networks (ANN) has proved very successful in various machine
learning environment. Recent developments in neural network based deep learning
architectures have demonstrated promising results in solving image classification
problems efficiently.
LeCum proposed CNN which is the combination of neural network with three key
concepts and designs where the adjacent layer neuron connections are enforced [15]
[16] [22]. The CNN network architecture consists of the following layers:
1. Convolutional layer
2. Sub-sampling layer and
3. An output layers
Convolutional layer:
It is a basic layer in the CNN architecture. Here, neurons are arranged in a two-
dimensional array. A feature map is extracted from this 2D array. The output of a
plane is called a feature map. A connection is based on the convolution mask
which is a matrix of adjustable entries. Bias term is a scalar which is added with
the convolutional output. Finally, an activation function is applied on the result to
obtain the plane’s output.
Convolution map:
The main idea of this convolution map is to extract the features from the given
denoised images. It has some set of learnable filters. Each filter is applied to the raw
pixel values of the image taking into account the red, green and blue color channels in a
S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning 105
sliding window fashion, computing the dot product between the filter pixel and the
input pixel. The dot product of the filter and the input pixel is calculated by applying
the sliding window procedure for getting the feature map which is activation map of
the filter. The features of the given input images such as edges and curves through the
learning process. In the training process, the CNN itself learns the values of the filters.
The convolving operation can be done by using the following equation:
Mi = bi +∑ WijڅXj (1)
j
Here,
Mi – feature map
–څoperator of convolution
Xj – the jth input channel
Wij – sub kernel of the channel
bi– bias
Figure.12.LeNet Architecture
Max-pooling map:
In the CNN architecture, the sub-sampling layers are used not only reducing the size of
the convolution maps but also introducing the rotational and translational invariance.
106 S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning
The maximum activation function value is declared as an output of the feature map.
The usage of the max-pooling operation is listed as follows:
The number of planes in the sub-sampling layer follows the convolution layer and
the sub-sampling layer divides the input into 2 x 2 pixels of non-overlapping blocks.
The sum is calculated in each block then multiplied with an adjustable weight and then
the bias value is added at the end. The output value is taken by activation functions to
produce the result. Hence the sub-sampling layer reduces the input size with dimension.
The scalar output is produced in the last convolution layer. In CNN, the output layer is
designed by using the sigmoidal function or the radial basis function and the
architecture of the radial basis function is shown in Figure.13. This radial bias network
is based on the approximation theory.
The newrb(P,T) function is used for creating the radial basis network in MATLAB
where P is the input and T is the output vector. The hidden nodes are used to
implement the radial basis function. In real time applications, the output shows the
category of the given image. It gives very fast and correct results.
Classification model
dimensional vector y of real values between 0 and 1. This function is also called as
normalized exponential function and its mathematical representation is as follows:
for j = 1, …, K. (3)
The confusion matrix table is used to measure the performance of the classification
model. The accuracy, precision and recall are calculated using thefollowing parameters.
Accuracy: This is a ratio of correctly forecasted observation and the total observations.
The proposed method is implemented in a set of real data which was acquired by using
mobile camera and digital camera for validation. The dataset contains nearly thousand
images of healthy and various disease affected stevia leaves. These images are taken
with different sizes, different poses, orientation and background. We have tested the
different range of training set and testing set to compare our proposed algorithm. The
training dataset varies from 80%, 60% and 40%.
Table-1. Accuracy, Precision and Recall of proposed method using CNN
The stochastic gradient descent algorithm is used in our proposed work to learn the best
set of weights and biases of the neural network to minimize the loss function. By using
accuracy, precision and recall, we can evaluate the efficiency of the proposed work
using color and gray scale images and the values are tabulated in Tabel-1.
The results were obtained using the deep learning model for identifying the stevia
disease affected images. Our proposed work was able to find good results to classify
the diseases of stevia leaves. Hence, the green color refers the healthy stevia leaves and
the leaves with black, brown and yellow color may be considered as disease affected
leaves.
6.Future Work
We got better results in the proposed work of disease detection in stevia plant. It can be
used as a decision support tool for identifying the disease affected leaves easily. In
future, this work can be extended to analyze and estimate the severity of the diseases
automatically which will help the farmers for taking the decision easily and effectively
at right time.
7. Conclusion
Plant diseases reduce the production and the quality of the end product severely.
Automated tools can act as an aid to recognize the problem at the earliest stage so that
the production and the quality can be maintained. In this paper, we presented a method
based on convolutional neural network to identify and classify the diseases in stevia. It
can act as a decision support tool for identifying diseases in stevia. Our main
contribution is to apply deep neural networks to detect different types of diseases in
stevia field.
References
1) Al-Hiary, H.; Bani-Ahmad, S.; Reyalat, M.; Braik, M.; ALRahamneh, Z.: Fast and accurate
detection and classification of plant diseases. International Journal of Computer Applications,
17(1):0975–8887, 2011
2) Cui, Di; Zhang, Qin; Li, Minzan; Hartman, Glen L.; Zhao, Youfu: Image processing methods for
quantitatively detecting soybean rust from multispectral images. Biosystems Engineering,
107(3):186–193, 2010.
3) Dheeb Al Bashish, Malik Braik, and Sulieman Bani-Ahmad , (2010)A Framework for Detection
and Classification of Plant Leaf and Stem Diseases, International Conference on Signal and Image
Procssing pp 113-118.
4) Elaksher A.F. “Multi-Image Matching Using Neural networks and Photogrammetric Conditions”,
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information
Sciences. Vol. XXXVII. Part I,, B3a. Beijing 2008.
5) Goyal, S.K., Samsher and Goyal, R.K. (2010). Stevia (Stevia rebaudiana) a bio-sweetener: A
review. International Journal of Food Sciences and Nutrition, 61, 1, 1-10.
6) Junna Cheng, Guangrong Ji, Chen Feng, Haiyong Zheng Application of Connected
Morphological Operators to Image Smoothing and Edge Detection of Algae. International
Conference on Information Technology and Computer Science. 2009, pp. 73-76.
110 S. Lakshmi and R. Sivakumar / Advanced Stevia Disease Detection Using Deep Learning
7) Kobylewski, S. and Eckhert, C.D. (2008). Toxicology of rabaudioside A: A review. Retrieved July
20, 2011.
8) Mohammed T.S, Al-Taie, “Artificial Neural Networks as Decision-Makers for Stereo Matching”,
GSTF International Journal on Computing, Vol.1, No. 3, August- 2012.
9) Mohan, K. Jagan; Balasubramanian, M.; Palanivel, S.: Detection and Recognition of Diseases
from Paddy Plant Leaf Images. International Journal of Computer Applications, 144(12):34–41,
2016.]
10) Muhammad Ghawas, M., Wan Zaki W.M, Zawayi M, Mansor M, Muhammad Azhar A,B(2009),
Penanaman dan pengeluaran tanaman stevia , Bull Teknol,Tanaman Bil.6
11) Patil, Sagar; Chandavale, Anjali: A Survey on Methods of Plant Disease Detection. International
Journal of Science and Research (IJSR), 6(14), 2013.
12) R.A.AI-Yahjao and D.Lupton, Leavy Medicinal herbs:Botany, Chemistry, Postharvest
Technology and Uses
13) Ramanathan Parimalavalli and S.Radhai Sri Periyar university , salem, PSG college of Arts and
Science, Coimbatore
14) S. L. Phung and A. Bouzerdoum, “A pyramidal neural network for visual pattern recognition,”
IEEE Transactions on Neural Networks, vol. 27, no. 1, pp. 329–343, 2007.
15) S. L. Phung and A. Bouzerdoum, ”MATLAB library for convolutional neural network,” Technical
Report, ICT Research Institute, Visual and Audio Signal Processing Laboratory, University of
Wollongong. Available at: http://www.uow.edu.au/˜phung.
16) Sabine D. Bauer, Filip Korc, Wolfgang Forstner, The Potential of Automatic Methods of
Classification to identify Leaf diseases from Multispectral images, Published online: 26 January
2011, Springer Science+Business Media, LLC 2011., Precision Agric (2011 12:361–377, DOI
10.1007/s11119-011-921
17) Sandesh B. K., Shalini C., B.R. Brinda, M.A. Kumar. Digital image processing—an alternate tool
for monitoring of pigment levels in cultured cells with special reference to green alga
Haematococcus pluvialis, Biosensors and Bioelectronics. 2005, 21: 768–773.25
18) Soejarto DD, Douglas K, Farnsworth NR. 1982. Potential sweetening agents of plant origin-III.
Organoleptic evaluation of Stevia leaf herbarium samples for sweetness. J Nat Prod 45(5):590-599
19) Tan S. L, Wan Zaki W.M., Muhamad Ghawas, M.Mansor, Zawawi M (2010), “Stevia (Stevia
rebaudiana Bertoni), In Teknologi Penananaman dan Pemprosesan Primer Tumbuhan Ubatan
(Musa Y, Mansor P, Yahaya H, Wan Zaki editor)]
20) Wang Jue, Wang shitong, Image Thresholding Using Parzen Window Estimation. Jurnal of applied
sciences 8(5):772-779, 2008, ISSN 18125654, Asian Network for Scientif Information, 2008].
21) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998
Deep Learning for Image Processing Applications 111
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-111
Abstract. In this work, an attempt has been made to demarcate Tuberculosis (TB)
sputum smear positive and negative images using statistical method based on Gray
Level Co-occurrence Matrix (GLCM). The sputum smear images (N=100)
recorded under standard image acquisition protocol are considered for this work.
Haralick descriptor based statistical features are calculated from the sputum smear
images. The most relevant features are ranked by principal component analysis. It
is observed that the first five principal components contribute more than 96% of
the variance for the chosen significant features. These features are further utilized
to demarcate the positive from negative smear images using Support Vector
Machines (SVM) and Differential Evolution based Extreme Learning Machines
(DE-ELM). Results demonstrate that DE-ELM performs better than SVM in terms
of performance estimators such as sensitivity, specificity and accuracy. It is also
observed that the generalization learning capacity of DE-ELM is better in terms of
number of hidden neurons utilized than the number of support vectors used by
SVM. Thus it appears that this method could be useful for mass discrimination of
positive and negative TB sputum smear images.
1. Introduction
Tuberculosis (TB) is the only disease ever declared a global emergency by the World
Health Organization (WHO). TB is an infectious disease caused by the bacillus
Mycobacterium tuberculosis. It is an airborne infectious disease that spreads easily in
densely populated areas with poor sanitation. TB spreads when people infected with
1
Corresponding Author, E. Priya, Assistant Professor, Department of Electronics and Communication
Engineering, Sri Sairam Engineering College, West Tambaram, Chennai, India; E-mail:
priya.ece@sairam.edu.in
2
S. Srinivasan, Associate Professor, Department of Instrumentation Engineering, MIT Campus, Anna
University, Chennai, India; E-mail: srini@mitindia.edu
112 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
pulmonary TB expel bacteria into air by coughing. It typically affects the lungs
(pulmonary TB) but can also affect other organs or tissues such as the brain, kidneys,
bone and skin (extra-pulmonary TB) [1, 2].
TB has existed for millennia and remains a major global health problem. Six
countries account for 60% of the global TB burden by WHO which includes India,
Indonesia, China, Nigeria, Pakistan and South Africa. This is despite the fact that with
a timely diagnosis and correct treatment, most people who develop TB disease can be
cured [1].
1.2. TB diagnosis
The diagnostic techniques for TB include both invasive and non-invasive methods.
Invasive techniques, such as QuantiFERON and T-SPOT blood tests, promise to be
more precise and reliable than other commonly used non-invasive methods. However,
according to WHO, the invasive methods are too expensive for poor resource countries
[3]. The non-invasive diagnostic techniques include smear microscopy, chest
radiography and culture test. Repeated examinations of the infected person are needed
for early detection of the disease. Conclusive result for culture test takes as much as 2
months because the disease causing agent, the tubercle bacilli take 5 to 20 hours to
duplicate. Chest radiography tests work adequately only when there is a high level of
infection and so this procedure cannot be used for TB diagnosis in early stages [4-6].
The WHO developed Directly Observed Treatment, Short course (DOTS) strategy
for TB control which has been adopted by many national tuberculosis programmes.
The DOTS strategy recommends smear microscopy as the most effective tool for the
diagnosis of TB and for monitoring patient response to treatment [7, 8]. The low and
middle-income countries mainly rely on the smear microscopy for TB screening as it is
the backbone of TB screening and cost effective too. Sputum smear microscopy is
capable of detecting majority of infectious TB cases thereby reducing TB prevalence
and incidence [9].
The widely used two microscopic diagnostic techniques for TB screening are the
fluorescence and conventional microscopy. The conventional microscopy uses a
conventional artificial light source. The sample smears are stained with carbolfuchsin
solution according to Ziehl-Neelsen (ZN) or Kinyoun acid-fast stains which cause the
TB bacilli to appear magenta against a light blue background. The bacilli may also take
different colours varying from light fuchsia to dark purple. These bacilli are called
acid-fast bacilli as they retain the dye even after washing with acid and alcohol [10, 11].
Conventional microscopy is inexpensive, rapid, and highly specific but has poor
sensitivity, particularly in patients co-infected with human immune deficiency virus. In
addition, examination of ZN-stained smears takes more time than fluorochrome-stained
smears. Also the bacilli are not clearly separated from the background in the image
obtained by a conventional light microscope [5, 8, 12].
Fluorescence Microscopy (FM) renders improved sensitivity in the diagnosis of
pulmonary TB. The technique uses an acid-fast fluorochrome dye such as auramine O
or auramine-rhodamine and high intense light source such as a halogen or high pressure
mercury vapor lamp. The bacilli appears in different colours such as reddish yellow,
reddish golden yellow or bright orange yellow fluorescence in a dark background
depending on the staining procedures. The bacilli when stained with auramine O and
excited by blue light (450–480 nm) emits in the green-reddish yellow range (500–600
nm) [11, 13].
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 113
During the past decade, outcomes from many published articles have revealed the
capability of texture analysis algorithms to infer diagnostically significant information
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 115
from medical images [31]. Texture is a characteristic that is present in almost all
images, and is considered to be one of the most important properties used in the
identification or classification of image objects or regions [32]. Since the textural
properties of images appear to carry useful information for discrimination purposes, it
is important to develop features from textural information [33]. Texture or structural
features represents the spatial distribution of gray values. Texture features are a rich
source of visual information and are key components in image analysis and
understanding [34]. Texture features are, in fact, mathematical parameters computed
from the distribution of pixels, which characterize the texture type and thus the
underlying structure of the objects present in the image [35].
A large body of literature exists for texture analysis of ultrasound, Magnetic
Resonance (MR) imaging, Computed Tomography (CT), fluorescence microscopy,
light microscopy and other digital images. Optical imaging modalities such as
fluorescence microscopy images of colonic tissue sections and light microscopy images
of the chromatin structure in advanced prostate cancer have utilized texture analysis [36,
37].
Haralick proposed the utility of Gray Level Co-occurrence Matrices (GLCM)
which has become one of the most well-known and widely used texture features in
medical image analysis. GLCM also called as gray tone spatial dependency matrix is a
tabulation of the frequencies or describes a combination of pixel brightness values in an
image [33].
Texture features derived from the co-occurrence matrix have been successfully
employed to develop criteria for the categorization of normal and cancerous colonic
mucosa, regardless of grade. Haralick’s texture features have also been implemented to
obtain methods for numerical description, objective representation and successive
categorization of cellular protein localization patterns in FM images [36].
Application of texture features include but not limited to identification of masses
in digital mammograms, content based image retrieval system for human brain
indexing, texture based segmentation and prostate cancer characterization on MR
images [38-41]. Literatures have suggested using image texture in combination with
color features to diagnose leukemic malignancy in samples of stained blood cells. In
combination with color, the texture features significantly improved the correct
classification rate of blood cell types compared to using only color features. Various
first-order statistics (such as mean gray level in a region) as well as second-order
statistics (such as gray level co-occurrence matrices) are derived to differentiate
different types of white blood cells [42].
Studies comparing the performance of GLCM features with other texture analysis
techniques show that GLCM is one of the most powerful methods for general texture
classification. GLCM features perform better than fractal; Markov random field and
Gabor filter features in classifying a wide range of texture images including synthetic
and natural images [32].
If measures are not taken to reduce the number of features before classification,
then it may reflect the noise or random error of the underlying data [43]. This will most
certainly result in over training as it gives too many degrees of freedom for the
classifier. To get good generalization properties of the classifier, it is desirable to keep
the number of features as low as possible. To perform the selection of features in an
automatic fashion, a method to judge the quality of the resultant classifier is needed
[14].
116 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
Classifiers are widely used in discriminating pathological condition from the normal.
The classification process involves grouping of data into pre-defined classes or finding
the class to which a data belongs. This process plays an important role in medical
image automation, which is a part of decision making in medical image analysis.
Machine learning based classification techniques provide support for many areas of
health care, including prognosis, diagnosis and screening [46].
Support Vector Machine (SVM) is a machine learning technique which is based on
statistical theory to characterize data and to recognize patterns. It is a supervised
learning method. The training principal behind SVM is that it seeks for an optimal
separating hyperplane so that the expected classification error for unseen test samples is
minimized. The foremost and notable feature of this approach is that the solution is
based only on the data points, which are at the extremity called support vectors [47].
Some of the benefits of SVM include handling of continuous and binary classes,
with reasonable speed of classification and accuracy. But SVM takes longer time for
training dataset and do not tackle discrete attributes. It finds application in
classification of brain MR images using genetic algorithm with SVM and is able to
classify brain tissue into normal, benign or malignant tumor [48]. SVMs have been
used for lung nodule detection from chest CT images, to classify digital mammography
images and in combination with multiscale local binary patterns to classify saliency-
based folded images [49-51].
Differential Evolutionary Extreme Learning Machine (DE-ELM) takes advantages
of both ELM and Differential Evolution (DE) and removes redundancy among hidden
nodes and achieves satisfactory performance with more compact and high speed neural
networks. The DE procedure is a global search optimization method that adjusts the
input weights and hidden layer biases whereas the output weights are determined by the
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 117
generalized inverse procedure. DE-ELM has been prolifically used in the field of
medical diagnosis because of its compact architecture and speed [52-54].
In this work, GLCM matrix based Haralick descriptors are extracted from the
digital TB sputum smear images. The most significant GLCM features are selected by
PCA and are further classified by supervised learning machines such as SVM and DE-
ELM for demarcating the TB positive from the negative images.
The chapter is organized as follows. Section 2 gives a brief description on the methods
adopted in this chapter. Section 2.1 presents the details of image acquisition protocol.
The remaining topics in section 2 describes the GLCM based texture analysis, PCA
based feature selection technique and the machine learning methods such as SVM and
hybrid DE-ELM classifiers. Section 3 discusses on the results of the above mentioned
methods and section 4 deals with significant conclusions.
2. Methodology
The sputum smear slides used were prepared by smearing the sputum specimen on a
clean slide. The air dried smears were then fixed to the slide by passing the slides
through a low flame two to three times. The slides were flooded with Auramine O stain
to bind the cell wall of the acid fast bacilli and left for 10 minutes. Then the slides were
decolorized with acid alcohol and washed with running water. To get a contrast
background the slides were counter stained with potassium permanganate. Then the
slides were rinsed with water and air dried before viewing under the microscope. The
fluorescence-stained slides were prepared at the South African National Health
Laboratory Services, Groote Schuur Hospital in Cape Town.
The images (N=100) were captured using a camera in monochrome binning mode
attached to a 20x objective fluorescence microscope of 0.5 numerical aperture. The
camera (AxioCam HR) has a resolution of 4164 x 3120 with a pixel size 6.45 μm (h) x
6.45 μm (v). The images of size 256 x 256 are considered for the analysis with positive
and negative sputum smear images of 50 each. The acquired images suffer from non-
uniform illumination, and hence these images are pre-processed by non-uniform
illumination correction technique [55]. The pipeline stages presented in Figure 1 gives
the overview of the work carried out. In the first stage, statistical texture analysis is
performed for the pre-processed images based on Gray-Level Co-occurrence Matrix
(GLCM) method.
Literature reveals that there are four groups of texture analysis methods, such as
statistical, geometrical, model-based and signal processing. Each and every method has
its own characteristics based on their capabilities and application in which it has been
implemented. Among them, the statistical method based on GLCM is reported to be the
most commonly and predominantly used [56]. It has been shown that the second-order
statistical-based co-occurrence matrix produces better classification accuracy over the
other texture analysis methods [57, 58].
Geometrical method-based structural analysis is well suited for the synthesis of an
image than its analysis, whereas estimation of model-based parameters is
computationally complex. The statistical method of defining texture describes the
spatial relationship of pixels in the GLCM. This approach represents the texture of an
image by calculating how often pixel pairs with specific values and in a specified
spatial relationship occur in an image. These methods normally achieve higher
discrimination indexes than the geometrical method based structural or transform
methods. In transform methods, the texture properties of the image are analyzed in the
frequency or scale space [36].
The GLCM matrices are constructed by computing pixel intensity value co-
occurrences at a distance of d = 1 , representing one pixel distance and for direction θ
°
given as 0 , 45 , 90 and 135 . Since single direction might not give enough and
reliable texture information, four directions are used to extract the texture information.
The four directional GLCM’s are averaged to furnish a non-directional matrix from
which the textural features are evaluated. The GLCM features obtained from this
°
matrix are rotation invariant, being averaged across 0 , 45 , 90 and 135 intervals
[36].
The diagonal of GLCM matrix will have values if an image contains no image
texture. The off diagonal values of the GLCM become larger as the image texture
increases. The number of rows and columns in GLCM is determined by the number of
gray scale intensity values in the image [59].
Features Formula
N g −1N g −1
∑ ∑ {i + j − μ − μ y } P(i, j )
3
Cluster shade x
i =0 j =0
N g −1N g −1
Dissimilarity ∑ ∑ P(i, j ) i − j
i =0 j =0
N g −1N g −1
Entropy − ∑ ∑ P(i, j ) log{P(i, j )}
i =0 j =0
N g −1
Difference entropy − ∑ P (i ) log{P (i )}
i =0
x− y x− y
N g −1 N g −1
μx = ∑ i ∑ P(i, j )
i =0 j =0
(1)
N g −1N g −1
μy = ∑ ∑ jP(i, j )
i =0 j = 0
(2)
N g −1N g −1
HXY 1 = − ∑ ∑ P(i, j ) log{Px (i )Py ( j )} (4)
i =0 j =0
120 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
N g −1N g −1
HXY 2 = − ∑ ∑ Px (i )Py ( j ) log{Px (i )Py ( j )} (5)
i =0 j = 0
P(i, j ) is the (i, j )th entry in the normalized gray level co-occurrence matrix
representing the probability density and N g is the number of gray levels in the image.
The probability of co-occurrence matrix coordinates of difference x − y is referred as
Px− y (k ) for i + j = k where
N g −1 N g −1
Px − y (k ) = ∑ ∑ P(i, j ) , k = 0,1,..., N g − 1 (6)
i =0 j =0
X = SP T + E (7)
where S is the score matrix comprising of the principal components, P the loadings
comprising of eigenvectors of covariance matrix and E the residual matrix
representing the variance. PCA basically rotates the data about their mean in order to
line up with the principal components. This procedure combines as much of the
variance as possible into the first few dimensions using an orthogonal transformation.
The values in the remaining dimensions, therefore, tend to be negligible and may be
removed with minimal loss of information [74, 75]. The most significant features
chosen by PCA in the second stage are used as input feature vectors to the machine
learning algorithms such as SVM and DE-ELM classifiers in the third stage.
The support vector machine is a linear machine of one output, formed by the non-linear
mapping of the N -dimensional input vector x into a K -dimensional feature space
(K > N ) through the use of non-linear function. SVM constructs a hyperplane or set
of hyperplanes in a high-or infinite-dimensional space, which can be used for
classification, regression, or other tasks. Intuitively, a good separation is achieved by
the hyperplane that has the largest distance to the nearest training data points of any
class (so-called functional margin), since in general the larger the margin the lower the
generalization error of the classifier.
{( ) ( )}
D = x1 , y1 ,........., x n , y n , x ∈ ℜn , y ∈ {1,−1} (8)
w, x + b = 0 (9)
122 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
min w, x i + b = 1 (10)
i
Different kernel functions such as linear, quadratic, Radial Basis Function (RBF)
and polynomial are used to process the data in higher dimensional space. Polynomial
mapping is a popular method for non-linear modeling, with a kernel function
K ( x, x ') = x, x '
d
(11)
where d is the order of the polynomial kernel. In this analysis, d is varied from three
to ten to identify the optimal order for classification based on its performance measures.
The optimal order of the polynomial kernel and optimal width of the RBF kernel
implemented in SVM classifier is chosen for securing maximum values of performance
estimators.
takes advantages of both ELM and Differential Evolution (DE) and removes
redundancy among hidden nodes and achieves satisfactory performance with more
compact and high speed neural networks.
Input samples
Parameter initialization
Mutation and
crossover
Selection
DE
Is error
min?
No
Yes
Stop
Figure 2. Flow chart for training of ELM based on differential evolutionary process
The SLFN is considered as a linear system after the hidden nodes parameters are
randomly initialized. The activation function for the hidden neurons attempted in this
work includes sigmoid, sine, hard limit, triangular and radial basis. These functions are
considered as there are infinitely differentiable. In this work, the input data are
normalized between 0 and 1 while the weights and biases are initialized between -1 and
+1. The outputs (targets) are normalized into [-1, 1]. The network has a better
generalization performance because of the smaller training error since the norm of
weights is small. The training process of ELM based on differential evolutionary
process is presented in Figure 2.
In the DE process, the population vector consisting of input weights and hidden
layer biases are defined as a vector
{
θ = w1T , w2T ,..., wTN , b1 , b2 ,..., bN } (12)
by a constant CR in the crossover process. The three steps of the DE process are
repeated for each of the generated population until the goal is met or the number of
maximum learning epochs is reached. The output weights are determined for each of
the individuals from the hidden layer output matrix H by Moore Penrose generalized
inverse. The structure of ELM resembles the architecture of a single hidden layer
feedforward neural network. The main goal of the training process is to determine the
network (input and output) weights
β i = [β i1 , β i 2 ,..., β im ]
T
(14)
and bi the hidden layer biases. This is to minimize the error function represented by
∑o
j =1
j −tj (15)
where
[
o j = o j1 , o j 2 ,..., o jn ]
T
(16)
[
t j = t j1 , t j 2 ,..., t jn ]T
(17)
∑ β g (w ⋅ x
i =1
i i i + bi ) = o j , j = 1,2,..., N (18)
whereg (⋅) is the activation function. More compactly the equation can be written as
Hβ = T from which the output weights are estimated as β = H †T where H † is the
pseudo-inverse of H [79].
The performance of the classifier is tested using the following estimators such as
sensitivity, specificity and accuracy. Sensitivity and specificity provides information
about false negatives and false positives. Accuracy integrates both the above indices
and indicates the number of images correctly classified. These measures are computed
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 125
from the values of True Positive (TP), True Negative (TN), False Positive (FP) and
False Negative (FN).
TP
Sensitivity = (19)
(TP + FN )
TN
Specificity = (20)
(TN + FP)
(TP + TN )
Accuracy = (21)
(TP + FN + TN + FP )
Positive Predictive Value (PPV) describes the proportion of positive tests that are
TP and represent the presence of disease. Negative Predictive Value (NPV) describes
the proportion of negative tests that are TN and represent the absence of disease.
TP
PPV = (22)
(TP + FP )
TN
NPV = (23)
(TN + FN )
F - measure is defined as the harmonic mean of precision (P) and recall (R)
and is given by
P× R
F = 2× (24)
P+R
Typical representative of TB positive and negative smear images are shown in Figure 3
(a) and (b) respectively. In general, the objects present in sputum smear images are
called as bacilli and non-bacilli. The TB positive image shows the presence of rod-
shaped bright foreground objects (bacilli) on a dark background. The negative images
show scanty or absence of bacilli. The non-bacilli objects are categorized into outliers
and overlapping bacilli. The outliers present can be due to poor or non-specific staining
of the smear slides or due to overlapping bacilli. The overlapping bacilli exist when
there is a conglomeration of bacilli in the image. The outliers present in the sputum
smear images do not have uniform morphology and the only objects that have uniform
morphology are the disease causative agent called TB bacilli.
Thus in a sputum smear image distinguishing bacilli from outliers cannot be
effectively done by manual intervention. Hence automated procedures aid in
126 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
demarcation of TB positive from negative smear images. The GLCM based analysis
attempted in this work serve as an automated procedure which captures the repetitive
patterns in the image, as the TB causing bacilli has a uniform rod shape.
(a) (b)
Figure 3. Typical TB (a) positive and (b) negative sputum smear image
Among the GLCM based Haralick features, five most significant features are
selected using PCA. The selection of the features was based on the highest magnitude
of eigenvalues in the first principal component.
Table 2. Contribution of the principal components for TB positive and negative images
Principal % variance
components TB positive TB negative
PC1 87.31 75.27
PC2 10.00 19.41
PC3 2.09 0.78
PC4 0.13 0.53
PC5 0.02 0.11
Magnitude
Features contributing to PC1
TB positive TB negative
Cluster shade 0.15 0.47
Dissimilarity 0.46 0.21
Entropy 0.42 0.42
Difference entropy 0.45 0.26
Information measure of correlation II 0.22 0.39
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 127
Table 3 lists the component magnitudes for the eigenvector corresponding to the
PC1 which has the largest eigenvalue. The magnitudes of the five component
magnitudes corresponding to PC1 are examined. The component dissimilarity has the
largest magnitude, followed by, difference entropy, entropy, information measure of
correlation II, and cluster shade for TB positive and the component cluster shade have
the largest magnitude followed by entropy, information measure of correlation II,
difference entropy and dissimilarity for TB negative images.
The higher magnitude denotes the similarity in direction of PC1 with basis vector
of original feature space. Thus this feature is the most sensitive feature followed by
others. As a result, the presented selection scheme was able to rank the five features.
Thus PCA approach is applied to select the most representative features extracted from
the original feature set to improve the effectiveness of classification.
0.5
entropy information measure
0.5 of correlation entropy
0.4
information measure
0.4 of correlation
0.3
0.3 difference
Principal com ponent 2
entropy
Principal com ponent 2
0.2
0.2 difference entropy
-0.1
-0.1
-0.2
cluster shade -0.2
-0.3
-0.3
-0.4
-0.5 -0.4
cluster shade
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 -0.5
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
Principal component 1 Principal component 1
(a) (b)
Figure 4. Variation in component magnitudes of selected GLCM features with the first five principal
components for TB (a) positive and (b) negative images
Dissimilarity
0.6
0.50
0.4
0.25
0.2
0.00 0.0
10 20 30 40 50 10 20 30 40 50
Image number Image number
(a) (b)
TB positive
1.0 1.0
TB negative TB positive
TB negative
0.8 0.8
Difference entropy
0.6 0.6
Entropy
0.4 0.4
0.2 0.2
0.0 0.0
10 20 30 40 50 10 20 30 40 50
Image number Image number
(c) (d)
1.0
Information measure of correlation
0.8
0.6
0.4
0.2 TB positive
TB negative
10 20 30 40 50
Image number
(e)
Figure 5. Variation in normalized (a) cluster shade (b) dissimilarity (c) entropy (d) difference
entropy (e) information measure of correlation II for different images
scattergrams that the TB positive and negative images are indistinguishable as the data
points overlap. Similar pattern of variation is observed for TB negative images between
dissimilarity and difference entropy but not in TB positive images. Distinct variation is
observed for the feature dissimilarity and difference entropy. The presence of tiny rod
shaped bacilli influences the magnitude of the dissimilarity feature to be high in
positive than negative images. The organization of gray levels influences the
magnitude of the difference entropy to be higher in TB positive than negative images.
The five most significant GLCM features are given as input to SVM and DE-ELM
classifier. Among the acquired TB images, 60% of them are chosen for training and
remaining 40% for testing. Table 4 shows the results of SVM classification based on
the parameter selection. Though the specificity measure is same for all the kernel types,
it is observed that the polynomial kernel of order three shows better performance in
terms of sensitivity and accuracy. Accuracy is observed to be 92.5% for polynomial
kernel with less number of support vectors.
Table 4. Performance analysis of SVM classifier
Other useful and more informative measures are the PPV, NPV and F-measure.
These measures also demonstrate that the SVM classifier performs better for
polynomial kernel.
25
Sigmoidal
Sinusoidal
20 Hard limit
Triangular basis
Radial basis
15
RMSE
10
0
0 5 10 15 20 25 30 35 40 45
Number of hidden neurons
Figure 7. Error plot for varying number of hidden neurons and activation function
The five most significant GLCM features are given as input to the hybrid DE-ELM
classifier to achieve better performance than SVM. The performance of DE-ELM
classifier is evaluated using RMSE by varying the number of hidden neurons as shown
in Figure 7. The evaluation is performed for different activation function such as
sigmoidal, sinusoidal, hard limit, triangular and radial basis. Results demonstrate that
the generalization performance of the hybrid classifier is stable on a wide range
(number) of hidden neurons. It is observed that sigmoidal activation function performs
with lower value of RMSE (0.409) for ten numbers of hidden neurons. Hence
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 131
performance of DE-ELM classifier is further analyzed for this activation function. The
slope of the sigmoidal activation function is chosen as 0.2 which is approximately
equal to inverse of the number of most significant features or number of input neurons.
Identified class
Confusion matrix
Positive Negative
Positive 20 0
Actual class
Negative 1 19
Table 5 shows the confusion matrix presenting the actual and predicted class of the
classifier. Since all the positive images subjected to the classifier are identified
correctly, sensitivity is observed to be 100%. Except one of the 40% test data
remaining are correctly identified and thus the accuracy is observed to be 97.5%.
Performance measures(%)
#
NHN 10
Sensitivity (Recall) 100
Specificity 95
Accuracy 97.5
PPV (Precision) 95.24
NPV 100
F - measure 97.56
#
NHN – Number of Hidden Neurons
4. Conclusion
Early diagnosis and effective treatment are the essential long-term strategies for
controlling TB epidemic. The first component of the WHO post-2015 strategy is early
diagnosis of TB [80]. Although many diagnostic methods have been developed, acid
fast bacilli smear microscopy is still considered as the gold standard to identify highly
contagious patients [81]. Thus performing automated analysis is essential for mass
screening of this epidemic.
In this work, an attempt has been made to automate the analysis of digital TB
images using statistical method based on GLCM approach. Haralick descriptor based
132 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
features are extracted from the sputum smear images. The most significant of them are
selected based on PCA. Further, these features are subjected to classification using
SVM and DE-ELM classifier and the performance analysis is studied.
Results demonstrate that the statistical features based on GLCM are able to
differentiate TB negative from the positive images. This would be possible as the
GLCM features characterize the pixel intensity distribution in the image. It was
observed that SVM classifier with polynomial kernel presented relatively higher
sensitivity, specificity and accuracy.
Compared to SVM the hybrid DE-ELM classifier reported a higher sensitivity,
specificity and accuracy in classifying the images into positive and negative. The
performance measures of the DE-ELM classifier is better because of its improved
evolutionary based learning algorithm which identifies more number of true positives
and true negatives when compared to SVM. The increase in sensitivity is observed
because of the uniform morphology exhibited by the bacilli present in TB positive
images. As this automated analysis is important in diagnosing and monitoring of
pulmonary tuberculosis, this study seems to be relevant for better clinical interventions.
Acknowledgement
The authors acknowledge Dr. Tania S. Douglas and Mr. Sriram Krishnan of Medical
Imaging Research Unit, University of Cape Town, South Africa for the thought worthy
discussions during acquisition of sputum smear images.
References
[11] C.F. Costa Filho and M.G. Costa, Sputum smear microscopy for tuberculosis: Evaluation of autofocus
functions and automatic identification of tuberculosis mycobacterium, Understanding Tuberculosis -
Global Experiences and Innovative Approaches to the Diagnosis, Dr. Pere-Joan Cardona (Ed.),
INTECH Open Access Publisher, (2012).
[12] A. Cattamanchi, J.L. Davis, W. Worodria, S. den Boon, S. Yoo, J. Matovu, J. Kiidha, F. Nankya, R.
Kyeyune, P. Byanyima and A. Andama, M. Joloba, D.H. Osmond, P.C. Hopewell and L. Huang,
Sensitivity and specificity of fluorescence microscopy for diagnosing pulmonary tuberculosis in a high
HIV prevalence setting. The International Journal of Tuberculosis and Lung Disease, 13(9), (2009),
1130-1136.
[13] B.J. Marais, W. Brittle, K. Painczyk, A.C. Hesseling, N. Beyers, E. Wasserman, D. van Soolingen and
R.M. Warren, Use of light-emitting diode fluorescence microscopy to detect acid-fast bacilli in sputum.
Clinical Infectious Diseases, 47(2), (2008), 203-207.
[14] C. Wählby, J. Lindblad, M. Vondrus, E. Bengtsson and L. Björkesten, Algorithms for cytoplasm
segmentation of fluorescence labelled cells. Analytical Cellular Pathology, 24(2, 3), (2002), 101-111.
[15] T.M. Daniel, Toman's tuberculosis. Case detection, treatment and monitoring: questions and answers,
(2004), p.351.
[16] P. Sadaphal, J. Rao, G.W. Comstock and M.F. Beg, Image processing techniques for identifying
mycobacterium tuberculosis in Ziehl-Neelsen stains [Short Communication]. The International Journal
of Tuberculosis and Lung Disease, 12(5), (2008), 579-582.
[17] R.A.A. Raof, M.Y. Mashor, R.B. Ahmad and S.S.M. Noor, Image Segmentation of Ziehl-Neelsen
Sputum Slide Images for Tubercle Bacilli Detection, Image Segmentation, Dr. Pei-Gee Ho (Ed.),
INTECH Open Access Publisher, (2011).
[18] A. Van Deun, A.H. Salim, E. Cooreman, M.A. Hossain, A. Rema, N. Chambugonj, M.A. Hye, A.
Kawria and E. Declercq, Optimal tuberculosis case detection by direct sputum smear microscopy: how
much better is more? The International Journal of Tuberculosis and Lung Disease, 6(3), (2002), 222-
230.
[19] R. Khutlang, S. Krishnan, R. Dendere, A. Whitelaw, K. Veropoulos, G. Learmonth and T.S. Douglas,
Classification of Mycobacterium tuberculosis in images of ZN-stained sputum smears. IEEE
Transactions on Information Technology in Biomedicine, 14(4), (2010), 949-957.
[20] M.G. Forero, G. Cristóbal and M. Desco, Automatic identification of Mycobacterium tuberculosis by
Gaussian mixture models. Journal of microscopy, 223(2), (2006), 120-132.
[21] K. Veropoulos, C. Campbell, G. Learmonth, B. Knight and J. Simpson, The automated identification of
tubercle bacilli using image processing and neural computing techniques. In 8th International
Conference on Artificial Neural Networks, Perspectives in Neural Computing. Springer, London, L.
Niklasson, M. Bodén, T. Ziemke, (Eds.), (1998), 797-802.
[22] K. Veropoulos, G. Learmonth, C. Campbell, B. Knight and J. Simpson, Automated identification of
tubercle bacilli in sputum: a preliminary investigation. Analytical and quantitative cytology and
histology, 21, (1999), 277-282.
[23] M.G. Forero, F. Sroubek and G. Cristóbal, Identification of tuberculosis bacteria based on shape and
color. Real-time imaging, 10(4), (2004), 251-262.
[24] R. Santiago-Mozos, R. Fernández-Lorenzana, F. Perez-Cruz and A. Artes-Rodriguez, On the uncertainty
in sequential hypothesis testing. In 5th IEEE International Symposium on Biomedical Imaging: From
Nano to Macro, Paris, France, (2008), 1223-1226.
[25] B. Lenseigne, P. Brodin, H.K. Jeon, T. Christophe, and A. Genovesio, Support vector machines for
automatic detection of tuberculosis bacteria in confocal microscopy images. In 4th IEEE International
Symposium on Biomedical Imaging: From Nano to Macro, Arlington, VA, USA, (2007), 85-88.
[26] R.A.A. Raof, Z. Salleh, S.I. Sahidan, M.Y. Mashor, S.M. Noor, F.M. Idris and H. Hasan, Color
thresholding method for image segmentation algorithm of Ziehl-Neelsen sputum slide images. In 5th
IEEE International Conference on Electrical Engineering, Computing Science and Automatic Control,
Mexico City, Mexico, (2008), 212-217.
[27] Y. Zhai, Y. Liu, D. Zhou and S. Liu, Automatic identification of mycobacterium tuberculosis from ZN-
stained sputum smear: Algorithm and system design. In IEEE International Conference on Robotics and
Biomimetics,Tianjin, China, (2010), 41-46.
[28] R. Rulaningtyas, A.B. Suksmono, T.L. Mengko and P. Saptawati, Identification of mycobacterium
tuberculosis in sputum smear slide using automatic scanning microscope. In S. Viridi, K. Basar, F.
Iskandar, W. Srigutomo, B. E. Gunara (Eds.), AIP Conference Proceedings, 1656(1), (2015), p. 060011.
[29] C.F.F. Costa Filho, P.C. Levy, C.D.M. Xavier, L.B.M. Fujimoto and M.G.F. Costa, Automatic
identification of tuberculosis mycobacterium. Research on Biomedical Engineering, 31(1), (2015), 33-
43.
134 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
[30] R.O. Panicker, B. Soman, G. Saini and J. Rajan, A review of automatic methods based on image
processing techniques for tuberculosis detection from microscopic sputum smear images. Journal of
medical systems, 40(1), (2016), 17.
[31] R. Lopes and N. Betrouni, Fractal and multifractal analysis: a review. Medical image analysis, 13(4),
(2009), 634-649.
[32] R.F. Walker, Adaptive multi-scale texture analysis: with application to automated cytology, PhD Thesis,
School of Computer Science and Electrical Engineering, The University of Queensland, (1997).
[33] R.M. Haralick and K. Shanmugam, Textural features for image classification. IEEE Transactions on
systems, man, and cybernetics, 3(6), (1973), 610-621.
[34] A. Madabhushi and D.N. Metaxas, Advances in computerized image analysis methods on breast
ultrasound. In Medical Imaging Systems Technology, Cornelius T. Leondes (Eds.), Analysis and
Computational Methods, 1, (2005), 119-150.
[35] G. Castellano, L. Bonilha, L.M. Li, and F. Cendes, Texture analysis of medical images. Clinical
radiology, 59(12), (2004), 1061-1069.
[36] V. Atlamazoglou, D. Yova, N. Kavantzas, and S.Loukas, Texture analysis of fluorescence microscopic
images of colonic tissue sections. Medical and Biological Engineering and Computing, 39(2), (2001),
145-151.
[37] K.W. Gossage, T.S. Tkaczyk, J.J. Rodriguez, and J.K. Barton, Texture analysis of optical coherence
tomography images: feasibility for tissue classification. Journal of biomedical optics, 8(3), (2003), 570-
575.
[38] A.M. Khuzi, R. Besar, W.W. Zaki, and N.N. Ahmad, Identification of masses in digital mammogram
using gray level co-occurrence matrices. Biomedical imaging and intervention journal, 5(3), (2009).
[39] J.E.E. de Oliveira, A. de Albuquerque Araújo, and T.M. Deserno, Content-based image retrieval applied
to BI-RADS tissue classification in screening mammography. World journal of radiology, 3(1), (2011),
24.
[40] H.B. Kekre, and S. Gharge, Texture based segmentation using statistical properties for mammographic
images. Entropy, 1, (2010), p.2.
[41] R. Lopes, A. Ayache, N. Makni, P. Puech, A. Villers, S. Mordon, and Betrouni, Prostate cancer
characterization on MR images using fractal features. Medical physics, 38(1), (2011), 83-95.
[42] M. Tuceryan, and A.K. Jain, Texture analysis. Handbook of pattern recognition and computer vision,
World Scientific Publishing Co., Inc. River Edge, NJ, USA, 2, (1993), 235-276.
[43] A. Kassner and R.E. Thornhill, Texture analysis: a review of neurologic MR imaging applications.
American Journal of Neuroradiology, 31(5), (2010), 809-816.
[44] C.C. Reyes-Aldasoro and A. Bhalerao, Volumetric texture description and discriminant feature selection
for MRI. In Biennial International Conference on Information Processing in Medical Imaging, C.
Taylor, J.A. Noble, (Eds.), Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2732,
(2003), 282-293.
[45] F. Wang, J. Wang, C. Zhang and J. Kwok, Face recognition using spectral features. Pattern Recognition,
40(10), (2007), 2786-2797.
[46] P. Luukka, Feature selection using fuzzy entropy measures with similarity classifier. Expert Systems
with Applications, 38(4), (2011), 4600-4607.
[47] J.A. Suykens, I. Horvath, S. Basu, C. Micchelli and J. Vandewalle (Eds.), Advances in learning theory:
methods, models, and applications, 190, (2003), IOS Press.
[48] S.A. Lashari and R. Ibrahim, A framework for medical images classification using soft set. Procedia
Technology, 11, (2013), 548-556.
[49] M. Bergtholdt, R. Wiemker and T. Klinder, Pulmonary nodule detection using a cascaded SVM
classifier. In SPIE International Society for Optics and Photonics, 9785, Medical Imaging 2016:
Computer-Aided Diagnosis, (2016), 978513.
[50] Y. Jiang, Z. Li, L. Zhang and P. Sun, An improved SVM classifier for medical image classification. In
International Conference on Rough Sets and Intelligent Systems Paradigms, M. Kryszkiewicz, J.F.
Peters, H. Rybinski, A. Skowron, (Eds.), Lecture Notes in Computer Science, Springer, Berlin,
Heidelberg, 4585, (2007), 764-773.
[51] Z. Camlica, H.R. Tizhoosh and F. Khalvati, Medical image classification via SVM using LBP features
from saliency-based folded data. In IEEE 14th International Conference on Machine Learning and
Applications, Miami, Florida, USA, (2015), 128-132.
[52] H.T. Huynh, and Y. Won, Hematocrit estimation from compact single hidden layer feedforward neural
networks trained by evolutionary algorithm. In IEEE World Congress on Computational Intelligence,
Hong Kong, China, (2008), 2962-2966.
[53] I.A. Yusoff, N.A.M. Isa, N.H. Othman, S.N. Sulaiman, and Y. Jusman, Performance of neural network
architectures: Cascaded MLP versus extreme learning machine on cervical cell image classification. In
E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM 135
IEEE 10th International Conference on Information Sciences Signal Processing and their Applications,
Kuala Lumpur, Malaysia, (2010), 308-311.
[54] J. Sánchez-Monedero, P.A. Gutiérrez, F. Fernández-Navarro and C. Hervás-Martínez, Weighting
efficient accuracy and minimum sensitivity for evolving multi-class classifiers. Neural Processing
Letters, 34(2), 2011), 101-116.
[55] E. Priya, S. Srinivasan and S. Ramakrishnan, Retrospective Non-Uniform Illumination Correction
Techniques in Images of Tuberculosis. Microscopy and Microanalysis, 20(05), (2014), 1382-1391.
[56] Y. Zeng, J. Zhang, J.L. Van Genderen, and Y. Zhang, Image fusion for land cover change detection.
International Journal of Image and Data Fusion, 1(2), (2010), 193-215.
[57] R. Susomboon, D. Raicu, J. Furst, and T.B. Johnson, A co-occurrence texture semi-invariance to
direction, distance, and patient size. In SPIE 6914, International Society for Optics and Photonics,
Medical Imaging 2008: Image Processing, (2008), 69141Y.
[58] C.E. Honeycutt and R. Plotnick, Image analysis techniques and gray-level co-occurrence matrices
(GLCM) for calculating bioturbation indices and characterizing biogenic sedimentary structures.
Computers & Geosciences, 34(11), (2008), 1461-1472.
[59] Z. Cai, X. Yan and Y. Liu, Advances in Computation and Intelligence. L. Kang (Ed.), (2008), Springer.
[60] C.H. Wei, C.T. Li and R. Wilson, A general framework for content-based medical image retrieval with
its application to mammograms. In Medical Imaging, International Society for Optics and Photonics,
(2005), 134-143.
[61] F. Albregtsen, Statistical texture measures computed from gray level coocurrence matrices. Image
processing laboratory, department of informatics, University of Oslo, 5, (2008).
[62] I. Ozdemir, D.A. Norton, U.Y. Ozkan, A. Mert and O. Senturk, Estimation of tree size diversity using
object oriented texture analysis and aster imagery. Sensors, 8(8), (2008), 4709-4724.
[63] C.H. Wei, C.T. Li and R. Wilson, A general framework for content-based medical image retrieval with
its application to mammograms. In Medical Imaging 2005: PACS and Imaging Informatics,
International Society for Optics and Photonics, (2005), 134-143.
[64] B.S. Anami and V.C. Burkpalli, Texture based identification and classification of bulk sugary food
objects. ICGST-GVIP Journal, 9(4), (2009), 9-14.
[65] A.L. Ion, Methods for Knowledge Discovery in Images. Information Technology and Control, 38(1),
(2009).
[66] A. Kuzmin, L. Korhonen, T. Manninen and M. Maltamo, Automatic Segment-Level Tree Species
Recognition Using High Resolution Aerial Winter Imagery. European Journal of Remote Sensing,
49(1), (2016), 239-259.
[67] D. Patra, and J. Mridula, Featured based segmentation of color textured images using glcm and markov
random field model. World Academy of Science, Engineering and Technology, 53(5), (2011), 108-113.
[68] S.D. Newsam and C. Kamath, Comparing shape and texture features for pattern recognition in
simulation data. In International Society for Optics and Photonics/SPIE's Annual Symposium on
Electronic Imaging, (2005), 106-117.
[69] E.M. van Rikxoort and E.L. van den Broek, Texture analysis. Graduate Research Proposal in AI, 15,
(2004).
[70] H. H. Hassan and S. Goussev, Texture Analysis of High Resolution Aeromagnetic Data to Identify
Geological Features in the Horn River Basin, NE British Columbia, (2011).
[71] K. Dong, Y. Feng, K.M. Jacobs, J.Q. Lu, R.S. Brock, L.V. Yang, F.E. Bertrand, M.A. Farwell and X.H.
Hu, Label-free classification of cultured cells through diffraction imaging. Biomedical optics express,
2(6), (2011), 1717-1726.
[72] C.C. Fagan, C.J. Du, C.P. O'Donnell, M. Castillo, C.D. Everard, D.J. O'Callaghan and F.A. Payne,
Application of Image Texture Analysis for Online Determination of Curd Moisture and Whey Solids in
a Laboratory Scale Stirred Cheese Vat. Journal of food science, 73(6), (2008), E250-E258.
[73] D. Gadkari, Image quality analysis using GLCM, (2004).
[74] I. Jolliffe, Principal component analysis. John Wiley & Sons, Ltd, (2002).
[75] D. Aguado, T. Montoya, L. Borras, A. Seco and & J. Ferrer, Using SOM and PCA for analysing and
interpreting data from a P-removal SBR. Engineering Applications of Artificial Intelligence, 21(6),
(2008), 919-930.
[76] V.N. Vapnik, The nature of statistical learning theory. Springer Verlag, New York, (1995).
[77] C.J. Burges, A tutorial on support vector machines for pattern recognition. Data mining and knowledge
discovery, 2(2), (1998), 121-167.
[78] N. Liu and H. Wang, Ensemble based extreme learning machine. IEEE Signal Processing Letters, 17(8),
(2010), 754-757.
[79] H.T. Huynh and Y. Won, Evolutionary algorithm for training compact single hidden layer feedforward
neural networks. In IEEE International Joint Conference on Neural Networks, (IEEE World Congress
on Computational Intelligence), Hong Kong, China, (2008), 3028-3033.
136 E. Priya and S. Srinivasan / Analysis of Tuberculosis Images Using DE-ELM
[80] C.H. Chen, Y.M. Chen, C.W. Lee, Y.J. Chang, C.Y. Cheng, and J.K Hung, Early diagnosis of spinal
tuberculosis. Journal of the Formosan Medical Association, 115(10), (2016), 825-836.
[81] J.C. Palomino, S.C Leão, and V. Ritacco, Tuberculosis 2007; from basic science to patient care (2007).
Deep Learning for Image Processing Applications 137
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-137
1. Introduction
Visual image retrieval aims at organizing and structuring image databases based
on their visual content. The proliferation of ubiquitous cameras in the last decade
has motivated researchers in the field to push the limits of visual search systems
with scalable yet effective solutions.
Representations based on convolutional neural networks (CNNs) have been
demonstrated to outperform the state-of-the-art in many computer vision tasks.
CNNs trained on large amounts of labeled data produce global representations
that effectively capture the semantics in images. Features from these networks
have been successfully used in various image retrieval benchmarks with very
promising results [1,2,3,4,5,6], improving upon the state-of-the-art compact image
representations for image retrieval.
Despite CNN-based descriptors performing remarkably well in instance search
benchmarks like the Oxford and Paris Buildings datasets, state-of-the-art an-
138 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
Figure 1. Examples of the top-ranked images and localizations based on local CNN features
encoded with BoW. Top row: The Christ Church from the Oxford Buildings dataset; middle row:
The Sacre Coeur from Paris Buildings; bottom row: query 9098 (a parking sign) from TRECVid
INS 2013.
swers for more challenging datasets such as TRECVid Instance Search (INS)
have not yet adopted pipelines that depend solely on CNN features. Many INS
systems [7,8,9,10] are still based on aggregating local handcrafted features (like
SIFT) using the Bag of Words encoding [11] to produce very high-dimensional
sparse image representations. Such high-dimensional sparse representations have
several benefits over their dense counterparts. High dimensionality means they are
more probable to be linearly separable while presenting relatively few non-zero
elements, which makes them efficient equally in terms of storage (only nonzero
components need to be stored), and computation (only non-zero elements need to
be visited). Sparse representations can handle varying information content and
are less likely to interfere with one another when pooled. From an information
retrieval perspective, sparse representations can be stored in inverted indices,
which facilitates efficient selection of images that share features with a query.
Furthermore, there is considerable evidence that biological systems make extensive
use of sparse representations for sensory information [12,13]. Empirically, sparse
representations have repeatedly demonstrated to be effective in a wide range of
vision and machine learning tasks.
Many efficient image retrieval engines combine an initial highly scalable ranking
mechanism on the full image database with a more computationally expensive
yet higher-precision reranking scheme applied to the top retrieved items. This
reranking mechanism often takes the form of geometric verification and spatial
analysis [14,15,16,8], after which the best matching results can be used for query
expansion (pseudo-relevance feedback) [17,18].
In this chapter, inspired by advances in CNN-based descriptors for image
retrieval, yet still focusing on instance search, we revisit the Bag of Words encoding
scheme using local features from convolutional layers of a CNN. This work presents
the following contributions:
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 139
2. Related Work
Several other authors have proposed CNN-based representations for image retrieval.
The first applications focused on replacing traditionally handcrafted descriptors
with features from a pre-trained CNN for image classification. Activations from
the last fully connected layers from the Alexnet network proposed by Krizhevsky
were the first ones to be used as a generic image representation with potential
applications for image retrieval [19,20,21]. Similar images generate similar ac-
tivation vectors in the Euclidean space. This finding motivated early works in
studying the capability of CNN models for retrieval, mostly focused on the analysis
of fully connected layers extracted from pre-trained CNN classification model
Alexnet [2,3,22]. In this context, Babenko et al. [2] showed how such features could
reach similar performance to handcrafted features encoded with Fisher vectors
for image retrieval. Razavian et al. [3] later outperformed the state-of-the-art of
CNN representations for retrieval using several image sub-patches as input to a
pre-trained CNN to extract features at different locations of the image. Similarly,
Liu et al. [23] used features from fully connected layers evaluated on image sub
patches to encode images using Bag of Words.
140 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
Deep learning has been proven as a mechanism to successfully learn useful semantic
representations from data. However, most of the discussed work use off-the-shelf
CNN representations for the task of retrieval, where representations have been
implicitly learned as part of a classification task on ImageNet dataset. This
approach presents two main drawbacks: the first one comes from the source dataset
ImageNet from where features have been learned. While Imagenet is a large-scale
dataset for classification, covering diverse 1000 classes (from airplanes, landmarks,
general objects) and allowing models to learn good generic features, it has been
explicitly designed to contain high intra-class invariance which is not a desirable
property to retrieval. The second, consists in the used loss function: Categorical
cross entropy evaluates the classification prediction without trying to discriminate
between instances from the same class, which may be desirable in several retrieval
scenarios.
One simple but yet effective solution to improve the capacity of the CNN
features consists in learning representations that are more suitable to the test
retrieval dataset by fine-tuning the CNN network to perform classification in
a new domain. This approach was followed by Babenko [2], where the Alexnet
architecture was trained to perform classification in a Landmark1 dataset, more
semantically similar to the target retrieval domain. Despite improving performance,
the final metric and the layers utilized were different to the ones actually optimized
during learning.
State-of-the art CNN retrieval networks have been tuned optimizing a similarity
loss function [35,4,26]. For that, the whole fine-tuning process of a CNN is casted
as a metric learning problem, where the CNN represents an embedding function
that maps the input image into a space where relative image similarities are
preserved. Siamese and Triplet networks are commonly used for that task.
1
L(i, j) = (Y (i, j)D(i, j) + (1 − Y (i, j)) max(0, α − D(i, j))) , (1)
2
where D(i, j) represents the Euclidean Distance between a pair of images
D(i, j) = f(i) − f(j)2 and f the embedding function (CNN network) that maps
an image to a point in an Euclidean space. When a pair of images belong to
the same category, the loss function tries to directly reduce the distance in the
feature space, whereas when images are different the loss is composed by a hinge
1 http://sites.skoltech.ru/compvision/projects/neuralcodes/
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 143
function that maximizes those distances, which are too small as they do not reach
a minimum margin α.
First introduced in 1994 for signature verification [39], Siamese networks have
been applied for dimensionality reduction [37], learning image descriptors [40,41,42]
or face verification [36,38].
where D(a, p) is the Euclidean distance between the anchor and a positive
example, D(a, n) is the Euclidean distance between the anchor and a negative
example and α a margin. The loss ensures that given an anchor image, the distance
between the anchor and a negative image is larger than the distance between the
anchor and a positive image by a certain margin α.
The main difference between triplet and Siamese architectures is that the
former one optimizes relative distances with a reference image or anchor, whereas
the later optimizes separately positive and negative pairs; which usually leads to
models with better performance [43,44].
Figure 2. Architectures for similarity learning. The baseline network is initialized with the weights
of a state-of-the-art CNN for classification. The top layer can be seen as the aggregation step
base in: A) fully connected layers [53]. Approaches dominated by direct pooling: B) direct pooling
sum/max-pooling followed by feature post processing [51] and C) Region pooling followed by
feature post processing [35] and D) custom aggregation model such as VLAD [27].
Table 1. Summary discussed CNN approaches in different retrieval benchmarks. The table does
not include approaches with spatial verification and re-ranking pipelines.
Method Dim Oxford5k Oxford105k Paris6k Paris106k Holidays
cedures that limits the generalization of the methods, usually related to landmarks
scenarios.
Despite the rapid advances of CNNs for image retrieval, many state-of-the-
art instance search systems are still based on Bag of Words encoding of local
handcrafted features such as SIFT [56]. The Bag of Words model has been enhanced
with sophisticated technique such as query foreground/background weighting [57],
asymmetric distances [58], or larger vocabularies [45,10]. The most substantial
improvements, however, are those involving spatial reranking stages. Zhou et al. [9]
propose a fast spatial verification technique which benefits from the BoW encoding
to choose tentative matching feature pairs between the query and the target image.
Zhang et al. [8] introduce an elastic spatial verification step based on triangulated
graph model. Nguyen et al. [7] propose a solution based on deformable parts
models (DPMs) [59] to rerank a BoW-based baseline. They train a neural network
on several query features to learn the weights to fuse the DPM and BoW scores.
In our work, we revisit the Bag of Words encoding strategy and demonstrate its
suitability to aggregate CNN features of an off-the-shelf network, instead of using
handcrafted ones such as SIFT. Our method is unsupervised and does not require
of any additional training data, and it is not restricted to any particular domain.
We also propose a simple and efficient spatial reranking strategy to allow query
expansion with local features. Although we do not use many of the well-known
improvements to the BoW pipeline for image search [57,45,58], we propose a
baseline system in which they could be easily integrated.
Publicly available image retrieval benchmarks such as the Oxford Buildings [45],
Paris dataset [46], Sculptures [60], INRIA Holidays [61] or Kentucky [62] are
relatively small size datasets used in the image retrieval community to test and
to compare different approaches for CBIR systems (see Table 2). Results are
reported in terms of mean Average Precision (mAP) of the list of images retrieved
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 147
from the dataset per query. Metrics such as memory or time required to conduct
the search are also important factors to take into account for most real-world
problems or when dealing with video, due to the large quantity of images in
the dataset. Some recent works reported excellent results using CNNs as image
representations [4,35,51,53].
Table 2. Datasets for image retrieval benchmarking. The table shows the total number of images
and queries of each dataset as well as the domain of search.
Figure 3. Samples from some retrieval benchmarks. First row, Sculptures; second row Paris;
third and fourth rows Kentucky.
However, the kind of queries for these datasets (see Figure 3 for some examples)
can be considered simpler than the queries in a generic instance search. In those
datasets, the objects are usually the main part of the image and topics to retrieve
are usually restricted to a particular domain.
retrieval by providing a large data collection and a uniform scoring procedure for
evaluation. The Instance Search task in TRECVID consists of finding 30 particular
instances within 464 hours of video (a total of 224 video files, 300GB). For each
query, 4 image examples are provided. A common procedure to deal with videos
is to perform key frame extraction. For example, in our 2014 participation [64],
the image dataset contained 647,628 image frames (66GB) by extracting 0.25
frames/second of the videos (which can be considered a low rate for a key frame
extraction).
Figure 4. Query examples for TRECVID Instance search task [63]. Instance queries are divers:
they can be logos, objects, buildings or people. Location and scale of the instances is also diverse.
There is no ground truth for the query images within the dataset and the
number of examples per query is limited to 4 frames. Figure 4 shows an example
of some of the 2014 TRECVid Instance Search queries.
Although promising results have shown the power of CNN representation on
different retrieval benchmarks, results have mainly been reported for relatively
small datasets, which are not sufficiently representative in terms of generalization
or complexity of the queries for real world problem such as instance search in
videos. Approaches that work well in small datasets may not work well in larger
and more realistic datasets, such as TRECVID
The proposed pipeline for feature extraction uses the activations at different
locations of a convolutional layer in a pre-trained CNN as local features. A CNN
trained for a classification task is typically composed of a series of convolutional
layers, followed by some fully connected layers, connected to a softmax layer that
produces the inferred class probabilities. To obtain a fixed-sized output, the input
image to a CNN is usually resized to be square. However, several authors using
CNNs for retrieval [5,24] have reported performance gains by retaining the aspect
ratio of the original images. We therefore discard the softmax and fully connected
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 149
layers of the architecture and extract CNN features maintaining the original image
aspect ratio.
Each convolutional layer in the network has D different N × M feature
maps, which can be viewed as N × M descriptors of dimension D. Each of these
descriptors contains the activations of all neurons in the convolutional layer sharing
the same receptive field. This way, these D-dimensional features can be seen as
local descriptors computed over the region corresponding to the receptive field of
an array of neurons. With this interpretation, we can treat the CNN as a local
feature extractor and use any existing aggregation technique to build a single
image representation.
We propose to use the BoW model to encode the local convolutional features
of an image into a single vector. Although more elaborate aggregation strategies
have been shown to outperform BoW-based approaches for some tasks in the
literature [26,65], BoW encodings produce sparse high-dimensional codes that can
be stored in inverted indices, which are beneficial for fast retrieval. Moreover, BoW-
based representations are faster to compute, easier to interpret, more compact,
and provide all the benefits of sparse high-dimensional representations previously
mentioned in Section 1.
BoW models require constructing a visual codebook to map vectors to their
nearest centroid. We use k-means on local CNN features to fit this codebook. Each
local CNN feature in the convolutional layer is then assigned its closest visual
word in the learned codebook. This procedure generates the assignment map, i.e.
a 2D array of size N × M that relates each local CNN feature with a visual word.
The assignment map is, therefore, a compact representation of the image, which
relates
W H each pixel of the original image with its visual word with a precision of
N , M pixels, where W and H are the width and height of the original image.
This property allows us to quickly generate the BoW vectors of not only the full
image, but also its parts. We describe the use of this property in our work in
Section 5.
Figure 6 shows the pipeline of the proposed approach. The described bag
of local convolutional features (BLCF) encodes the image into a sparse high
dimensional descriptor, which will be used as the image representation for retrieval.
150 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
5. Image Retrieval
This section describes the image retrieval pipeline, which consists of an initial
ranking stage, followed by a spatial reranking, and query expansion.
(a) Initial search: The initial ranking is computed using the cosine similarity
between the BoW vector of the query image and the BoW vectors of the full images
in the database. We use a sparse matrix based inverted index and GPU-based
sparse matrix multiplications to allow fast retrieval. The image list is then sorted
based on the cosine similarity of its elements to the query. We use two types of
image search based on the query information that is used:
• Global search (GS): The BoW vector of the query is built with the visual
words of all the local CNN features in the convolutional layer extracted for
the query image.
• Local search (LS): The BoW vector of the query contains only the visual
words of the local CNN features that fall inside the query bounding box.
(b) Local reranking (R): After the initial search, the top T images in the
ranking are locally analyzed and reranked based on a localization score. We
choose windows of all possible combinations of width w ∈ {W, W W
2 , 4 } and height
H H
h ∈ {H, 2 , 4 }, where W and H are the width and height of the assignment map.
We use a sliding window strategy directly on the assignment map with 50% of
overlap in both directions.
We additionally perform a simple filtering strategy to discard those windows
whose aspect ratio is too different to the aspect ratio of the query. Let the aspect
W
ratio of the query bounding box be ARq = Hqq and ARw = W w
Hw be the aspect
min(AR ,AR )
ratio of the window. The score for window w is defined as scorew = max(ARww ,ARqq ) .
All windows with a score lower than a threshold th are discarded.
For each of the remaining windows, we construct the BoW vector represen-
tation and compare it with the query representation using cosine similarity. The
window with the highest cosine similarity is taken as the new score for the image
(score max pooling).
We also enhance the BoW window representation with spatial pyramid match-
ing [66] with L = 2 resolution levels (i.e. the full window and its 4 sub regions).
We construct the BoW representation of all sub regions at the 2 levels, and weight
their contribution to the similarity score with inverse proportion to the resolution
level of the region. The cosine similarity of a sub region r to the corresponding
1
query sub region is therefore weighted by wr = 2(L−l r)
, where lr is the resolution
level of the region r. Figure 7 depicts the described approach.
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 151
$ &
# ' <
With this procedure, the top T elements of the ranking are sorted based on
the cosine similarity of their regions to the query’s, and also provides the region
with the highest score as a rough localization of the object.
(c) Query expansion: We investigate two query expansion strategies based on
global and local BoW descriptors:
• Global query expansion (GQE): The BoW vectors of the N images at the
top of the ranking are averaged together with the BoW of the query to form
the new representation for the query. GQE can be applied either before or
after the local reranking stage.
• Local query expansion (LQE): Locations obtained in the local reranking
step are used to mask out the background and build the BoW descriptor of
only the region of interest of the N images at the top of the ranking. These
BoW vectors are averaged together with the BoW of the query bounding
box. The resulting BoW vector is used to perform a second search.
6. Experiments
6.1. Datasets
Figure 8. Query examples from the three different datasets. Top: Paris buildings (1-3) and Oxford
buildings (4-6); bottom: TRECVid INS 2013.
30 queries and provides 4 still images for each of them (including a binary mask
of the object location). In our experiments, we use a subset of this dataset that
contains only those key frames that are positively annotated for at least one of
the queries. The dataset, which we will refer to as the TRECVid INS subset, is
composed of 23,614 key frames.
Figure 8 includes three examples of query objects from the three datasets.
Feature extraction was performed using Caffe [67] and the VGG16 pre-trained
network [68]. We extracted features from the last three convolutional layers
(conv5 1, conv5 2 and conv5 3) and compared their performance on the Oxford
5k dataset. We experimented with different image input sizes: 1/3 and 2/3 of the
original image. Following several other authors [1,24], we L2 -normalize all local
features, followed by PCA, whitening, and a second round of L2 -normalization.
The PCA models were fit on the same dataset as the test data in all cases.
Unless stated otherwise, all experiments used a visual codebook of 25,000
centroids fit using the (L2 -PCA-L2 transformed) local CNN features of all images
in the same dataset (1.7M and 2.15M for Oxford 5k and Paris 6k, respectively). We
tested three different codebook sizes (25,000; 50,000 and 100,000) on the Oxford
5k dataset, and chose the 25,000 centroids one because of its higher performance.
Table 3 shows the mean average precision on Oxford 5k for the three different
layers and image sizes. We also consider the effect of applying bilinear interpolation
of the feature maps prior to the BoW construction, as a fast alternative to using
a larger input to the CNN. Our experiments show that all layers benefit from
feature map interpolation. Our best result was achieved using conv5 2 with full-size
images as input. However, we discarded this configuration due to its memory
requirements: on a Nvidia GeForce GTX 970, we found that feature extraction on
images rescaled with a factor of 1/3 images was 25 times faster than on images
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 153
Table 3. Mean average precision (mAP) on Oxford 5k using different convolutional layers
of VGG16, comparing the performance of different feature map resolutions (both raw and
interpolated). The size of the codebook is 25,000 in all experiments.
twice that size. For this reason, we resize all images to 1/3 of their original size
and use conv5 1 interpolated feature maps.
1
w(i, j) = , (3)
(i − c1 )2 + (j − c2 )2
where (i, j) represents the position of a visual word within the assignment
map and (c1 , c2 ) correspond to the center coordinated of the assignment map.
w(i, j) min-max normalized to provide scores between 0 and 1. Table 3 shows the
mean average precision on Oxford 5k for the three different layers and image sizes.
All results are obtained using this weighting criteria; for conv5 1 in Oxford 5k,
increases mAP from 0.626 to 0.653.
Previous works [17,69] have demonstrated how simple data augmentation strategies
can improve the performance of an instance search system. Some of these apply
augmentation strategies at the database side, which can be prohibitively costly
154 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
X 0.653
X X 0.640
X X X 0.645
X X 0.671
Table 5. mAP on Oxford 5k for the two different types of query augmentation: the flip and the
zoomed central crop (ZCC). 2× interpolated conv5 1 features are used in all cases.
+ Flip
Query + Flip + ZCC
+ ZCC
for large datasets. For this reason, we use data augmentation on the query side
only. We explore two different strategies to enrich the query before visual search:
a horizontal flip (or mirroring) and a zoomed central crop (ZCC) on an image
enlarged by 50%.
Figure 9 shows an example of the transformations, which give rise to 4 different
versions of the query image. The feature vectors they produce are added together
to form a single BoW descriptor. Table 5 shows the impact of incrementally
augmenting the query with each one of these transformations.
We find that all the studied types of query augmentation consistently improve
the results, for both global and local search. ZCC provides a higher gain in perfor-
mance compared to flipping alone. ZCC generates an image of the same resolution
as the original, which contains the center crop at a higher resolution. Objects
from the Oxford dataset tend to be centered, which explains the performance gain
when applying ZCC.
We apply the local reranking (R) stage on the top-100 images in the initial ranking,
using the sliding window approach described in Section 5. The presented aspect
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 155
Table 6. mAP on Oxford 5k and Paris 5k for the different stages in the pipeline introduced in
Section 5. The Qaug additional columns indicate the results when the query is augmented with
the transformations introduced in Section 6.3.
Oxford 5k Paris 6k
+Qaug +Qaug
ratio filtering is applied with a threshold th = 0.4, which was chosen based
on a visual inspection of results on a subset of Oxford 5k. Query expansion is
later applied considering the top-10 images of the resulting ranking. This section
evaluates the impact in the performance of both reranking and query expansion
stages. Table 6 contains the results for the different stages in the pipeline for both
simple and augmented queries (referred to as Qaug in the table).
The results indicate that the local reranking is only beneficial when applied
to a ranking obtained from a search using the global BoW descriptor of the query
image (GS). This is consistent with the work by Tolias et al. [5], who also apply a
spatial reranking followed by query expansion to a ranking obtained with a search
using descriptors of full images. They achieve a mAP of 0.66 in Oxford 5k, which
is increased to 0.77 after spatial reranking and query expansion, while we reach
similar results (e.g. from 0.652 to 0.769). However, our results indicate that a
ranking originating from a local search (LS) does not benefit from local reranking.
Since the BoW representation allows us to effectively perform a local search (LS)
in a database of fully indexed images, we find the local reranking stage applied to
LS to be redundant in terms of the achieved quality of the ranking. However, the
local reranking stage does provide with a rough localization of the object in the
images of the ranking, as depicted in Figure 1. We use this information to perform
query expansion based on local features (LQE).
Results indicate that query expansion stages greatly improve performance
in Oxford 5k. We do not observe significant gains after reranking and QE in the
Paris 6k dataset, although we achieve our best result with LS + R + LQE.
In the case of augmented queries (+Qaug ), we find that this query expansion
is less helpful in all cases, which suggests that the information gained with query
augmentation and the one obtained by means of query expansion strategies are
not complementary.
156 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
Table 7. Comparison to state-of-the-art CNN representations (mAP). Results in the lower section
consider reranking and/or query expansion.
Oxford Paris
5k 105k 6k 106k
Ng et al. [25] 0.649 - 0.694 -
Razavian et al. [4] 0.844 - 0.853 -
SPoC [1] 0.657 0.642 - -
R-MAC [5] 0.668 0.616 0.830 0.757
CroW [24] 0.682 0.632 0.796 0.710
uCroW [24] 0.666 0.629 0.767 0.695
GS 0.652 0.510 0.698 0.421
LS 0.739 0.593 0.820 0.648
LS + Qaug 0.758 0.622 0.832 0.673
CroW + GQE [24] 0.722 0.678 0.855 0.797
R-MAC + R + GQE [5] 0.770 0.726 0.877 0.817
LS + GQE 0.773 0.602 0.814 0.632
LS + R + LQE 0.788 0.651 0.848 0.641
LS + R + GQE + Qaug 0.793 0.666 0.828 0.683
We compare our approach with other CNN-based representations that make use
of features from convolutional layers on the Oxford and Paris datasets. Table 7
includes the best result for each approach in the literature. Our performance using
global search (GS) is comparable to that of Ng et al. [25], which is the one that
most resembles our approach. However, they achieve this result using raw Vector
of Locally Aggregated Descriptors (VLAD) features, which are more expensive to
compute and, being a dense high-dimensional representation, do not scale as well to
larger datasets. Similarly, Razavian et al. [4] achieve the highest performance of all
approaches in both the Oxford and Paris benchmarks by applying a spatial search
at different scales for all images in the database. Such approach is prohibitively
costly when dealing with larger datasets, especially for real-time search scenarios.
Our BoW-based representation is highly sparse, allowing for fast retrieval in large
datasets using inverted indices, and achieves consistently high mAP in all tested
datasets.
We find the usage of the query bounding box to be extremely beneficial in
our case for both datasets. The authors of SPoC [1] are the only ones who report
results using the query bounding box for search, finding a decrease in performance
from 0.589 to 0.531 using raw SPoC features (without center prior). This suggests
that sum pooled CNN features are less suitable for instance level search in datasets
where images are represented with global descriptors.
We also compare our local reranking and query expansion results with similar
approaches in the state-of-the-art. The authors of R-MAC [5] apply a spatial search
for reranking, followed by a query expansion stage, while the authors of CroW [24]
only apply query expansion after the initial search. Our proposed approach also
achieves competitive results in this section, achieving the best result for Oxford
5k.
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 157
In this section, we compare the BLCF with the sum pooled convolutional features
proposed in several works in the literature. We use our own implementation of the
uCroW descriptor from [24] and compare it with BLCF for the TRECVid INS
subset. For the sake of comparison, we test our implementation of sum pooling
using both our chosen CNN layer and input size (conv5 1 and 1/3 image size),
and the ones reported in [24] (pool5 and full image resolution). For the BoW
representation, we train a visual codebook of 25,000 centroids using 3M local CNN
features chosen randomly from the INS subset. Since the nature of the TRECVid
INS dataset significantly differs from that of the other ones used so far (see
Figure 8), we do not apply center prior to the features in any case, to avoid down
weighting local features from image areas where the objects might appear. Table 8
compares sum pooling with BoW in Oxford, Paris, and TRECVid subset datasets.
As stated in earlier sections, sum pooling and BoW have similar performance
in Oxford and Paris datasets. For the TRECVid INS subset, however, Bag of
Words significantly outperforms sum pooling, which demonstrates its suitability
for challenging instance search datasets, in which queries are not centered and
have variable size and appearance. We also observe a different behavior when using
the provided query object locations (LS) to search, which was highly beneficial
in Oxford and Paris datasets, but does not provide any gain in TRECVid INS.
We hypothesize that the fact that the size of the instances is much smaller in
TRECVid than in Paris and Oxford datasets causes this drop in performance.
Global search (GS) achieves better results on TRECVid INS, which suggests
that query instances are in many cases correctly retrieved due to their context.
Figure 12 shows the mean average precision of the different stages of the pipeline
for all TRECVid queries separately. The global search significantly outperforms
local search for most queries in the database. Figure 11 shows examples of the
queries for which the local search outperforms the global search. Interestingly, we
find these particular objects to appear in different contexts in the database. In
these cases, the usage of the local information is crucial to find the query instance
in unseen environments. For this reason, we compute the distance map of the
binary mask of the query, and assign a weight to each position of the assignment
map with inverse proportion to its value in the distance map. This way, higher
weights are assigned to the visual words of local CNN features near the object. We
find this scheme, referred to as weighted search (WS), to be beneficial for most of
the queries, suggesting that, although context is necessary, emphasizing the object
information in the BoW descriptor is beneficial.
We finally apply the local reranking and query expansion stages introduced
in Section 5 to the baseline rankings obtained for the TRECVid INS subset. Since
we are dealing with objects whose appearance can significantly change in different
keyframes, we decided not to filter out windows based on aspect ratio similarity.
Additionally, we do not apply the spatial pyramid matching, since some of the
query instances are too small to be divided in sub regions. After reranking, we
apply the distance map weighting scheme to the locations obtained for the top 10
images of the ranking and use them to do weighted query expansion (WQE).
Results are consistent with those obtained in the experiments in Oxford
and Paris datasets: although the local reranking does not provide significant
158 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
Figure 10. Appearances of the same object in different frames of TRECVid Instance Search.
Figure 11. Top 5 rankings for queries 9072 (top) and 9081 (bottom) of the TRECVid INS 2013
dataset.
1
Average Precision
0.8
0.6
0.4
0.2
0
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
90
GS LS WS WS + GQE WS + R + WQE
Table 8. mAP of sum pooling and BoW aggregation techniques in Oxford, Paris and TRECVid
INS subset.
7. Conclusion
References
[1] A. Babenko and V. Lempitsky. Aggregating local deep features for image retrieval. In
Proceedings of the IEEE international conference on computer vision, pages 1269–1277,
2015.
[2] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval.
In Computer Vision–ECCV 2014, pages 584–599. 2014.
[3] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf:
an astounding baseline for recognition. In Computer Vision and Pattern Recognition
Workshops, 2014.
[4] Ali S Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. Visual instance
retrieval with deep convolutional networks. ITE Transactions on Media Technology and
Applications, 4(3):251–258, 2016.
160 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
[5] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of
cnn activations. In International Conference on Learning Representations, 2016.
[6] L. Xie, R. Hong, B. Zhang, and Q. Tian. Image classification and retrieval are one. In
Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages
3–10. ACM, 2015.
[7] V. Nguyen, D. Nguyen, M. Tran, D. Le, D. Duong, and S. Satoh. Query-adaptive late
fusion with neural network for instance search. In International Workshop on Multimedia
Signal Processing (MMSP), pages 1–6, 2015.
[8] W. Zhang and C. Ngo. Topological spatial verification for instance search. IEEE Transac-
tions on Multimedia, 17(8):1236–1247, 2015.
[9] X. Zhou, C. Zhu, Q. Zhu, S. Satoh, and Y. Guo. A practical spatial re-ranking method
for instance search from videos. In International Conference on Image Processing (ICIP),
2014.
[10] C. Zhu and S. Satoh. Large vocabulary quantization for searching instances from videos.
In International Conference on Multimedia Retrieval (ICMR), 2012.
[11] J. Sivic and A. Zisserman. Efficient visual search of videos cast as text retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(4):591–606, 2009.
[12] P. Lennie. The cost of cortical computation. Current biology, 13(6):493–497, 2003.
[13] W. E. Vinje and J. L. Gallant. Sparse coding and decorrelation in primary visual cortex
during natural vision. Science, 287(5456):1273–1276, 2000.
[14] H. Jégou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search.
International Journal of Computer Vision, 87(3):316–336, 2010.
[15] Y. Zhang, Z. Jia, and T. Chen. Image retrieval with geometry-preserving visual phrases.
In Computer Vision and Pattern Recognition (CVPR), pages 809–816, 2011.
[16] T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search reranking: A literature survey.
ACM Computing Surveys (CSUR), 46(3):38, 2014.
[17] R. Arandjelović and A. Zisserman. Three things everyone should know to improve object
retrieval. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages
2911–2918, June 2012.
[18] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query
expansion with a generative feature model for object retrieval. In IEEE International
Conference on Computer Vision, Rio de Janeiro, Brazil, October 2007.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf:
A deep convolutional activation feature for generic visual recognition. In Proceedings of
the 31st International Conference on International Conference on Machine Learning -
Volume 32, ICML’14, pages I–647–I–655. JMLR.org, 2014.
[21] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural
networks? In Proceedings of the 27th International Conference on Neural Information
Processing Systems, NIPS’14, pages 3320–3328, Cambridge, MA, USA, 2014. MIT Press.
[22] Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. Multi-scale orderless
pooling of deep convolutional activation features. In Computer Vision–ECCV 2014. 2014.
[23] Y. Liu, Y. Guo, S. Wu, and M. Lew. Deepindex for accurate and efficient image retrieval.
In International Conference on Multimedia Retrieval (ICMR), 2015.
[24] Y. Kalantidis, C. Mellina, and S. Osindero. Cross-dimensional weighting for aggregated
deep convolutional features. In European Conference on Computer Vision VSM Workshop,
pages 685–701. Springer International Publishing, 2016.
[25] J. Yue-Hei Ng, F. Yang, and L. S. Davis. Exploiting local features from deep networks for
image retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, June 2015.
[26] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact
image representation. In 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 3304–3311, June 2010.
[27] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture
E. Mohedano et al. / Object Retrieval with Deep Convolutional Features 161
for weakly supervised place recognition. In IEEE Conference on Computer Vision and
Pattern Recognition, 2016.
[28] A. Jimenez, J. M. Alvarez, and X. Giro-i Nieto. Class-weighted convolutional features for
visual instance search. In 28th British Machine Vision Conference (BMVC), September
2017.
[29] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for
discriminative localization. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016.
[30] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on, pages 248–255, June 2009.
[31] C. Reyes, E. Mohedano, K. McGuinness, and X. O’Connor, N.and Giro-i-Nieto. Where is
my phone?: personal object retrieval from egocentric images. In Proceedings of the first
Workshop on Lifelogging Tools and Applications, pages 55–62. ACM, 2016.
[32] J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. O’Connor. Shallow and deep
convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 598–606, 2016.
[33] A. Salvador, X. Giro-i Nieto, F. Marques, and S. Shin’ichi. Faster R-CNN Features for
Instance Search. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, June 2016.
[34] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(6):1137–1149, June 2017.
[35] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. End-to-end learning of deep visual
representations for image retrieval. International Journal of Computer Vision, 124(2):237–
254, 2017.
[36] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with
application to face verification. In 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1, June 2005.
[37] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant
mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), volume 2, pages 1735–1742, 2006.
[38] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level
performance in face verification. In Computer Vision and Pattern Recognition (CVPR),
2014 IEEE Conference on, pages 1701–1708, June 2014.
[39] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using
a” siamese” time delay neural network. In Advances in Neural Information Processing
Systems, pages 737–744, 1994.
[40] N. Carlevaris-Bianco and R. M Eustice. Learning visual feature descriptors for dynamic
lighting conditions. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ
International Conference on, pages 2769–2776. IEEE, 2014.
[41] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Moreno-Noguer. Fracking deep
convolutional image descriptors. arXiv preprint arXiv:1412.6537, 2014.
[42] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional
neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4353–4361, 2015.
[43] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International
Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
[44] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face
recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 815–823, 2015.
[45] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large
vocabularies and fast spatial matching. In 2007 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–8, June 2007.
[46] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving
particular object retrieval in large scale image databases. In Proceedings of the IEEE
162 E. Mohedano et al. / Object Retrieval with Deep Convolutional Features
the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
[68] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[69] P. Turcot and D. Lowe. Better matching with fewer features: The selection of useful
features in large database recognition problems. In ICCV WS-LAVD workshop, 2009.
164 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-164
1. Introduction
feature extraction. Finally, Section 6 presents the conclusions of this work and
proposes some future research lines.
2. Background
Reinforcement learning is a powerful tool that has been used in a wide range of
applications. According to [10], reinforcement learning is ”a way of programming
agents by reward and punishment without needing to specify how the task is to be
achieved”. This agent is a decision-maker that interacts with the environment and
learns through trial-and-error. In the hierarchical object detection task addressed
in this chapter, the agent must learn how to find an object in the image by
iteratively deciding among different zoom-in alternatives. This agent must learn
based on the positive or negative feedback provided during training after all the
zoom-in steps are taken.
Reinforcement learning based on deep neural networks gained attention after
the impressive results from DeepMind [15], who trained an agent to play Atari
2600 video games by observing only their screen pixels, achieving even superhu-
man performance. Google Deepmind also trained a computer that won the Go
competition to a professional player for the first time [22]. More specifically to
traditional computer vision tasks, reinforcement learning has been applied to learn
spatial glimpse policies for image classification [14,1], for captioning [26] or for
activity recognition [27]. It has also been applied for object detection in images
[2], casting a Markov Decision Process (MDP), as our approach does. MDPs are
defined in Section 3.1.
The traditional solutions for object detection are based on region proposals, such
as Selective Search [25], CPMC [3] or MCG [17], or other methods based on sliding
windows such as EdgeBoxes [29], or hierarchical partitions such as BPTs [20,6].
The extraction of such proposals was independent of the classifier that would
score and select which regions compose the final detection. These methods are
computationally expensive because rely on a large amount of object proposals. Then
the first trends based on Convolutional Neural Networks appeared, such as Fast
R-CNN [7], that already studied how to share convolutional computation among
locations, as they identified that the extraction of features for the hypothesized
objects was the bottleneck for object detection.
More recent proposals such as Faster R-CNN [19,21] and Mask R-CNN [8]
have achieved efficient and fast object detection by obtaining cost-free region
proposals sharing full-image convolutional features with the detection network.
Directly predicting bounding boxes from an image is a difficult task, and for
this reason, approaches such as Faster R-CNN rely on a number of reference
boxes called anchors, that facilitate the task of predicting accurate bounding
boxes by regressing these initial reference boxes. One key of our approach is the
M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning 167
This section presents our Hierarchical Object Detection Model (HODM) trained
with reinforcement learning. We formulate the object detection problem as the
sequential decision process of a goal-oriented agent interacting with a visual
environment that is our image. At each time step, the agent should decide in
which region of the image to focus its attention so that it can find objects in a
few steps.
State: The state is composed by the descriptor of the current region and a
memory vector. The type of descriptor defines the two models we compare in our
work: the Image-Zooms model and the Pool45-Crops model. These two variations
are explained in detail in Section 3. The memory vector of the state captures the
last 4 actions that the agent has already performed in the search for an object.
As the agent is learning a refinement of a bounding box, a memory vector that
encodes the state of this refinement procedure is useful to stabilize the search
trajectories. We encode the past 4 actions in a one-shot vector. As there are 6
different actions presented in the following section, the memory vector has 24
dimensions. This type of memory vector was also used in [2].
Actions: There are two types of possible actions: movement actions that imply
a change in the current observed region, and the terminal action to indicate
that the object is found and that the search has ended. One particularity of our
system is that each movement action can only transfer the attention top-down
between regions from a predefined hierarchy. A hierarchy is built by defining five
sub-regions over each observed bounding box: four quarters distributed as 2x2
over the box and a central overlapping region. We have explored two variations of
this basic 2x2 scheme: a first one with non-overlapped quarters (see Figure 1),
and a second one with overlapped quarters (see Figure 2), being the size of a
sub-region 3/4 of its ancestor. Then, there are five movement actions, each one
associated to one of the yellow regions. If, on the other hand, the terminal action
is selected, there is no movement and the final region is the one marked with blue.
Reward: The reward functions used are the ones proposed by [2]. The reward
function for the movement actions can be seen in Equation 1 and the reward
function for the terminal action in Equation 2. Given a certain state s, a reward
is given to those actions that move towards a region b’ with a greater Intersection
Over Union (IoU) with the ground truth g than the region b considered at the
previous step. Otherwise, the actions are penalized. For the trigger action, the
reward is positive if the Intersection Over Union of the actual region b with the
ground truth is greater than a certain threshold τ , and negative otherwise. We
consider τ = 0.5, because it is the threshold for which a detection is considered
positive, and η is 3, as in [2].
3.2. Q-learning
The reward of the agent depending on the chosen action a at state s is governed
by a function Q(s,a), that can be estimated with Q-learning. Based on Q(s,a), the
agent will choose the action that is associated to the highest reward. Q-learning
iteratively updates the action-selection policy using the Bellman equation 3, where
s and a are the current state and action correspondingly, r is the immediate
reward and maxa Q(s , a ) represents the future reward. Finally, γ represents the
discount factor. In our work, we approximate the Q-function by a Deep Q-network
trained with Reinforcement Learning [15].
We study two variations of our HOD-DRL model depending on how the visual
features are extracted, referred in the remaining of the chapter as Image-Zooms and
Pool45-Crops. Figure 3 depicts the two variations with the common reinforcement
learning network.
We compare two models to extract the visual features that define the state
of our agent: the Image-Zooms model and the Pool45-Crops model. For the
Image-Zooms model, each region is resized to 224x224 and its visual descriptors
correspond to the feature maps from Pool5 layer of VGG-16 [23]. For the Pool45-
Crops model, the image at full-resolution is forwarded into VGG-16 [23] through
Pool5 layer. As [7], we reuse the feature maps extracted from the whole image
for all the regions of interest (ROI) by pooling them (ROI pooling). As in SSD
[12], we choose which feature map to use depending on the scale of the region of
interest. In our case, we only work with the Pool4 and Pool5 layers, that are the
two last pooling layers from VGG-16. Once we have a certain region of interest
from the hierarchy, we decide which feature map to use by comparing the scale of
the ROI and the scale of the feature map. For large objects, the algorithm will
select the deeper feature map, whereas for smaller objects a shallower feature map
is more adequate.
170 M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning
The two models for feature extraction result into a feature map of 7x7 which
is fed to the common block of the architecture. The region descriptor and the
memory vector are the input of the Deep Q-network that consists of two fully
connected layers of 1024 neurons each. Each fully connected layer is followed by a
ReLU [16] activation function and is trained with dropout [24]. Finally, the output
layer corresponds to the possible actions of the agent, six in our case.
Figure 3. Architecture of our HOD-DRL model in its two varations for feature extraction:
Image-Zooms (upper branch) and Pool45-Crops (lower branch).
This section introduces the training details of the Q-network, which corresponds
to the deeper layers of our HOD-DRL model, as shown in Figure 3. This is the
only trained part of our model, as convolutional layers are used off-the-shelf from
the VGG-16 model.
.
Exploration-Exploitation: To train the deep Q-network with reinforcement learn-
ing we use an -greedy policy, that starts with =1 and decreases until =0.1 in
steps of 0.1. Then, we start with random actions, and at each epoch the agent
takes decisions relying more on the already learnt policy. Actually, in order to help
the agent to learn the terminal action, which in random could be difficult to learn,
we force it each time the current region has an IoU ¿ 0.5. With this approach, we
can accelerate the training. Notice that we always do exploration, so we do not
get stuck into a local minimum.
Learning Trajectories: One fact that we detected while training was that we
should not impose which object of the image to look first. At each time step,
the agent will focus on the object in the current region with the highest overlap
with its ground-truth. This way, it is possible then that the target object changes
during the top-down exploration.
M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning 171
Training Parameters: The weights for the Deep Q-network were initialized from
a normal distribution. For learning, we used Adam optimizer [11] with a learning
rate of 1e-6 to avoid that the gradients explode. We trained each model for 50
epochs.
Experience Replay: As we have seen previously, Bellman Equation 3 learns from
transitions formed by (s, a, r, s’), which can also be called experiences. Consecutive
experiences in our algorithm are very correlated and this could lead to inefficient
and unstable learning, a traditional problem in Q-learning. One solution to make
the algorithm converge is collecting experiences and storing them in a replay
memory. Random minibatches from this replay memory are used to train the
network. We used an experience replay of 1,000 experiences and a batch size of
100.
Discount Factor: To perform well in the long run, the future rewards should also
be taken into account and not only the immediate ones. To do this, we use the
discounted reward from Bellman Equation 3 with a value of γ = 0.90. We set the
gamma high because we are interested in balancing the immediate and future
rewards.
5. Experiments
Our experiments on object detection have used images and annotations from the
PASCAL VOC dataset [5]. We trained our system on the trainval sets of 2007 and
2012, and tested it on the test set of 2007. We performed all the experiments for
just one class, the airplane category, and only considering pictures with the target
class category. This experiment allows us to study the behavior of our agent and
estimate the amount of regions that must be analyzed to detect an object.
We present some qualitative results in Figure 4 to show how our agent behaves
on test images. These results are obtained with the Image-Zooms model with
overlapped regions, as this is the one that yields best results, as argued in the
following sections. We observed that for most images, the model successfully zooms
towards the object and completes the task in a few steps. As seen in the second,
third and fourth rows, with just two or three steps, the agent selects the bounding
box around the object. The agent also performs accurately when there are small
instances of objects, as seen in the first and last rows.
We will analyze the precision and recall curves for different trained models,
considering that an object is correctly detected when the Intersection over Union
(IoU) of its bounding box compared to the ground truth is over 0.5, as defined by
the Pascal VOC challenge [5].
The Precision-Recall curves are generated by ranking all regions analyzed by
the agent. The sorting is based on the reward estimated by the sixth neuron of the
Q-network, which corresponds to the action of considering the region as terminal.
172 M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning
Upper Bound and Random Baselines: Our results firstly include baseline and
upper bound references for a better analysis. As a baseline we have programmed
an agent that chooses random actions and detection scores at each time step. As
an upper bound, we have exploited the ground truth annotations to manually
guide our agent towards the region with the greatest IoU. The result of these
random baselines and upper bounds for hierarchy type can be seen in Figure 5. It
is also important to notice that the best upper-bound option does not even achieve
a recall of 0.5. This poor performance is because more than half of the ground
truth objects do not fit with the considered region proposals, so they cannot be
detected in our framework.
Overlapped and Non-overlapped Regions: The results obtained with the upper
bound and baseline methods provide enough information to compare the overlapped
and non-overlapped schemes. The overlapped regions scheme is the one that
provides higher precision and recall values, both for the upper bound and the
random models. This superiority of the overlapped case can be explained by the
slower reduction of spatial scale with respect to the non-overlapped model: as
bounding box regions are larger due to the overlap, their division in equal-size
sub-regions also generate larger sub-regions. This also implies that the agent will
require more steps to reach a lower resolution, but this finer top-down exploration
is shown as beneficial in our experiments as the chances of missing an object
during the descent are also lower.
Model Comparison: The Image-Zooms model and the Pool45-Crops model are
compared in Figure 6. Results clearly indicate that the Image-Zooms model
performs better than the Pool45-Crops model. We hypothesize that this loss
of performance is due to the loss of resolution resulting from the ROI-pooling
M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning 173
over Pool4 or Pool5 layers. While in the Image-Zooms model the 7x7 feature
maps of Pool5 have been computed directly from a zoom over the image, in the
Pool45-Crops model the region crops over Pool4 or Pool5 could be smaller than
7x7. While these cases would be upsampled to the 7x7x512 input tensor to the
deep Q-Net, the source feature maps of the region would be of lower resolution
than their counterparts in the Image-Zoom model.
Models at Different Epochs: We study the training of our Image-Zooms model by
plotting the Precision-Recall curves at different epochs in Figure 7. As expected,
we observe how the performance of the model improves with the epochs.
6. Conclusions
This paper has presented our HOD-DRL model as a deep reinforcement learn-
ing solution for object detection. Our solution is characterized by a top-down
174 M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning
Acknowledgments
This work has been developed in the framework of the project BigGraph TEC2013-
43935-R, funded by the Spanish Ministerio de Economia y Competitividad and the
European Regional Development Fund (ERDF). This work has been supported
by the grant SEV2015-0493 of the Severo Ochoa Program awarded by Spanish
Government, project TIN2015-65316 by the Spanish Ministry of Science and
Innovation contracts 2014-SGR-1051 by Generalitat de Catalunya. The Image
Processing Group at the UPC is a SGR14 Consolidated Research Group recognized
and sponsored by the Catalan Government (Generalitat de Catalunya) through its
AGAUR office. We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the GeForce GTX Titan Z used in this work and the support
of BSC/UPC NVIDIA GPU Center of Excellence. We also want to thank all the
members of the X-theses group for their advice.
References
[1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.
arXiv preprint arXiv:1412.7755, 2014.
[2] J. C. Caicedo and S. Lazebnik. Active object localization with deep reinforcement learning.
In Proceedings of the IEEE International Conference on Computer Vision, pages 2488–2496,
2015.
[3] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using constrained
parametric min-cuts. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
34(7):1312–1328, 2012.
M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning 175
[4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep
neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2147–2154, 2014.
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual
object classes (voc) challenge. International journal of computer vision, 88(2):303–338,
2010.
[6] X. Giro and F. Marques. Detection of semantic objects using description graphs. In Image
Processing, 2005. ICIP 2005. IEEE International Conference on, volume 1, pages I–1201.
IEEE, 2005.
[7] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer
Vision, pages 1440–1448, 2015.
[8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint
arXiv:1703.06870, 2017.
[9] Z. Jie, X. Liang, J. Feng, X. Jin, W. Lu, and S. Yan. Tree-structured reinforcement learning
for sequential object localization. In Advances in Neural Information Processing Systems,
pages 127–135, 2016.
[10] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey.
Journal of artificial intelligence research, 4:237–285, 1996.
[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd:
Single shot multibox detector. In European conference on computer vision, pages 21–37.
Springer, 2016.
[13] Y. Lu, T. Javidi, and S. Lazebnik. Adaptive object detection using adjacency and zoom
prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2351–2359, 2016.
[14] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in
Neural Information Processing Systems, pages 2204–2212, 2014.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep
reinforcement learning. Nature, 518(7540):529–533, 2015.
[16] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10),
pages 807–814, 2010.
[17] J. Pont-Tuset, P. Arbeláez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial
grouping for image segmentation and object proposal generation. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI), 2016.
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. arXiv preprint arXiv:1506.02640, 2015.
[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. In Advances in neural information processing systems,
pages 91–99, 2015.
[20] P. Salembier and L. Garrido. Binary partition tree as an efficient representation for
image processing, segmentation, and information retrieval. IEEE transactions on Image
Processing, 9(4):561–576, 2000.
[21] A. Salvador, X. Giró-i Nieto, F. Marqués, and S. Satoh. Faster r-cnn features for in-
stance search. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pages 9–16, 2016.
[22] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556, 2014.
[24] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:
a simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15(1):1929–1958, 2014.
[25] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as
selective search for object recognition. In 2011 International Conference on Computer
Vision, pages 1879–1886. IEEE, 2011.
176 M. Bellver Bueno et al. / Hierarchical Object Detection with Deep Reinforcement Learning
[26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Ben-
gio. Show, attend and tell: Neural image caption generation with visual attention. In
International Conference on Machine Learning, pages 2048–2057, 2015.
[27] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection
from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2678–2687, 2016.
[28] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon. Attentionnet: Aggregating
weak directions for accurate object detection. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2659–2667, 2015.
[29] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In European
Conference on Computer Vision, pages 391–405. Springer, 2014.
Deep Learning for Image Processing Applications 177
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-177
1. Introduction
Enormous information in the form of big data has turned into an essential topic for
each industry. The capacity to contemplate and dissect substantial areas of data to
discover new trends and patterns in medicine, business and everything in the middle of
it. Application of analytics in your business can prompt to advancement and revelations
that we won't see generally.
Big data is not a new word but it simply takes available data and exploring it in a
different way. The traditional business tools are unable to manage and process Big
Data’ because it refer to a large set of data. On the other hand Deep data is the real tool
that we can use to change the data .
Big data is an concoction of all of the data collected by a business. The data can be
names of clients, contact information or routine data etc. Predictive analysis can be
applied to find out patterns and trends to abstract the useless and redundant information.
1
Corresponding Author.
178 M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges
The massive amount of data is pares down into useful section of information
excluding redundant information .It's a collection of massive and complex data sets that
when it is difficult to capture, analyze, store and search using traditional data base
system. Figure 1 below shows the place of big data in various contexts.
The use of advanced analytic techniques for very large, diverse data sets that
include various varieties such as structured or unstructured and different sizes from
terabytes to zettabytes. The term Big data coined to show data sets whose size or type
is beyond the handling by traditional relational databases to capture, manage, and
process the data with low-latency. And it has one or more of the following
characteristics – high volume, high velocity, or high variety. Big data comes from data
generated in a huge amount from sensors, devices, video/audio, networks, log files,
transactional applications, web, and social media generated in real time and in a very
large scale.
Analyzing big data allows analysts, researchers, and business users to make better
and faster decisions using data that was previously inaccessible or unusable. Using
advanced analytics techniques such as text analytics, machine learning, predictive
analytics, data mining, statistics, and natural language processing, businesses can
analyze previously untapped data sources independent or together with their existing
enterprise data to gain new insights resulting in significantly better and faster decisions.
Fig 1 shows Big data in various context.
ͻ ĐƵƐƚŽŵĞƌƐĂƚŝĚĨĂĐƚŝŽŶ
ϯϲKĚĞŐƌĞĞ
ǀŝĞǁŽĨ
ĐƵƐƚŽŵĞƚ
YƵĂůŝƚĂƚŝǀĞ
ĂƉƉƌŽĂĐŚ
ͻ WƌĞĚŝĐƚŝŽŶƐ
ͻ ,ĞĂůƚŚĐĂĞ
ͻ ĞĐŽŶŽŵŝĐĚĞǀĞůŽƉŵĞŶƚ ǀĞƌLJĚĂLJďŝŐĚĂƚĂϮϱ
ͻ &ŝƌĞƐĞƌǀĐĞ YƵŝŶƚŝůůŝŽŶďLJƚĞƐŽĨ
ĚĂƚĂ
ͻ ĂƚĂŐƌŽǁŝŶŐϰyĨĂƐƚ
Big data and deep data are inherently similar both use mass of information that can
be collected on daily basis. Industry can pair this with data with analytics and use in
prediction to help in prediction of trends or changes. Useful Trends can find out on this
mass of data using techniques of big data, but deep data on the other side focus on
specific information and predict trends.
For example If we want to predict which clothing style are going to sell the best
during the next calendar year. Then we wouldn’t necessarily be looking at customer’s
especially if it's an online selling. Deep data instead, focus on data like sales numbers
and information of product to make predictions. This is how deep data works.
If we apply Deep data analysis to medicine and many fields as well. Deep data
focus on one particular demographic features such as age, weight, gender or race, can
help searching of trail participant in a more streamlined way and accuracy can be
increased for treatment or drug trials.
Learning Deep data rapidly popular field that involve in modeling of complex
patterns as multilayered networks. The challenges in machine learning and artificial
intelligence are best solved by Deep learning. Areas like recognizing images, text
analytics, recognizing speech companies like Microsoft and Google use deep learning
to find out solutions in such areas. The model constructed by deep learning require
sufficient computing power.
Choice of Big data or Deep data for any business need depends upon the kind of
business running and what is the type of data we are collecting. Deep data is a best
option when you are observant for specific trends or focusing individual pieces of
information. Deep data let you allow to eliminate vain or unessential pieces of data
while retaining the foremost information that will benefit company.
Big data and deep data are necessity not an option they both are very useful
techniques for any type of business.
services and solutions that matched those of their neighbors, but now they expect
products and services that best fit their individual needs and interests. For example
phone service delivery and application delivery as same service to everyone which fits
best in one's lifestyle.
Presence of big data is increasing day by day and each business making themselves
more and more equipped to handle abundance of data about their information on
customers. There is some value embed in this volume of data, there is need of adding
meaning full value to the customer. Here Comes Deep Data. Realization of that
particular business value big data move beyond that, things needs change here.
Companies move towards a framework called the "Deep Data" framework. It's an
approach that says data streams rich in information rich information can retrieve more
value then data streams small number of information-rich data.
Shifting of approach from big to deep made business are better able to understand
their clients. In many sectors, technology driven by deep data is helping better
engaging consumers. There is a need of data centric view at every level of organization.
From the organizational perspective, companies must start to internalize and adopt a
data-centric view at every level of the organization.
Deep data is impassive to change the way organization engage their customers.
Regardless of industry, deep data is poised to disrupt the way organizations engage
customers. The companies which are embracing deep data models gain success in sales
and customer engagement. Those that were unable to incorporate deep data model are
and far away from analytics are fail to meet the customer satisfaction and consumer
demand.
There are things that need to be changed. From 2015 , companies moved towards
framework of deep data that state that data streams that are rich in information can
retrieve more value then mass of data. When a company shifts towards deep data big
data businesses are able to better understand their customers and offer actionable,
scalable and customized insights while crucially enhancing the value of the economic
investment in data to their businesses.
In the energy sector, deep data-driven technology is helping utilities better engage
consumers and benefit from trusted customer relationships in the face of new energy
choices, like solar, microgrids and retail energy providers. As consumers become
“prosumers” who create new energy options – often at their own sites, utilities must
find new, innovative ways to deliver value to customers who want the most cost-
effective, energy efficient options. The below figure 2 shows the origin of Deep Data:
ϮϬϭϱ;ĞĞƉ
ϮϬϭϮͲϮϬϭϰ;ŝŐ ůĞĂƌŶŝŶŐͿ
ĂƚĂΘĂƚĂ
ϮϬϬϲͲ ^ĐŝĞŶĐĞͿ
ϮϬϬϳ;'ŽŽŐůĞ
ΘWƌĞĚŝĐƚŝǀĞ
ϭϵϲϬͲ ĂŶĂůLJƚŝĐƐͿ
ϮϬϬϯ;ĂƚĂ
DŝŶŝŶŐͿ
фϭϵϬϬ;^ƚĂƚŝƐƚŝĐͿ
There exist a need of data gain access accurately in real time to produce positive
returns on our decision-making. Sales people need real time data about customers
before their consumer meetings, marketers need real time data to create targeted
campaigns, plus our professionals need easily digestible data in order to make business
decisions in real time--especially when on the go.
The regularization of data can prove a dramatic change for our company, Below are the
steps to achieve that:
1. How is companies data-driven decisions will become SOP for end users and
for client processes.
2. Data needs to be abundant throughout your company and use of that data
should be simple and automated.
3. The use of tools should prove that how these tools are useful for embedded
analytics, mobile view of data and workflow integration and enhanced
visualizations for our organization.
4. Data presented should have capability to influence the end user to consume
that data effectively. The influence of end user only make effective utilization
of tools otherwise it is wastage of money and time of company.
6. In work flow process only the most abstracted form of information should be
distributed.
7. The user significance curve is of great importance. The company should give
self-service data preparation, data integration, and data enrichment.
8. Overcoming challenges
Every organization can appreciate the promise of accessing data easily. But the
problem is to access the right tool. According to a recent study done, almost seventy
M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges 183
percent of the companies failed to use analytics tool properly for data regularization
and seventy percent of the companies percent willing to increase their budget for
analytics in order to use better analytics tools. Figure 3 shows some open challenges
between Big Data and Deep Data.
ͻĂƚĂƋƵĂůŝƚLJ ͻƌƚŶŽƚƐĐŝĞŶĐĞ
ŚĂůůĞŶŐĞƐ
ŚĂůůĞŶŐĞƐ
ͻĂƚĂWƌŽĐĞƐƐŝŶŐ ͻĨƵůůƉƌŽŽĨĂƌĐŚŝƚĞĐƚƵƌĞŝƐ
ͻĂƚĂƉƌŽƚĞĐŝŽŶ ŶĞĞĚĞĚ
ͻĂƚĂĐůĞĂŶŝŶŐ ͻŶĞĞĚƐŽŵĞĨĞĂƚƵƌĞƚŚĂƚƐ
ͻdŽŽůƐ ĞdžƉůĂŝŶĚĞĐŝƐŝŽŶƐ
ͻ^ŬŝůůƐĞƚ ͻEŽƐƵƉƉŽƌƚĨŽƌƌĞĂƐŽŶŝŶŐ
ͻEŽƐƵƉƉŽƌƚĨŽƌŽŵŵŽŶ
ƐĞŶƐĞ
The adoption of advance analytic tools made companies to get much better business
intelligence which will in turn increase transparency which aids fast decision making.
Below Table 2 mention some of the challenges and solution:
Challenges Solution
Predictive capabilities for faster decisions App
Extract, transform, or load for data migration CRM tool
Data enrichment , data visualization or report Self-service BI tools.
creation,
10. Global relationship and data patterns are also retrieved from deep learning
Linear models retrieval can also be obtained from complex data representations and can
be successfully applied to text, video etc. Abstraction of raw data can lead to relational
and semantic data. Deep learning architecture are more suitable to address the issues
and problems of veracity, variety , volume, velocity as shown in Fig below:
ŝŐĂƚĂ
sĞƌĂĐŝƚLJ
ŝŐĂƚĂ
sĞůŽĐŝƚLJ
ŝŐĂƚĂ
ǀŽůƵŵĞ
ŝŐĂƚĂ
ǀĂƌŝĞƚLJ
/ƐƐƵĞƐ
ĞĞƉůĞĂƌŶŝŶŐĂƌĐŚŝƚĞƚƵƌĞĂŶĚĂůŐŽƌŝƚŚŵ
Some algorithm have shallow learning hierarchies which are not capable of
understanding high data patterns complexities. The most important feature of deep
learning is to deals with data abstraction and representations so it is best suited for
analyzing raw data in different formats and from different sources. While presenting
different challenges for more conventional data analysis approaches, Big Data
Analytics presents an important opportunity for developing novel algorithms and
models to address specific issues related to Big Data.
Deep Learning concepts provide one such solution venue for data analytics experts and
practitioners. For example, the extracted representations by Deep Learning can be
considered as a practical source of knowledge for decision-making, semantic indexing,
M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges 185
information retrieval, and for other purposes in Big Data Analytics, and in addition,
simple linear modeling techniques can be considered for Big Data Analytics when
complex data is represented in higher forms of abstraction.
In the remainder of this section, we summarize some important works that have been
performed in the field of Deep Learning algorithms and architectures, including
semantic indexing, social tagging. Our focus is that by presenting these works in Deep
Learning, experts can observe the novel applicability of Deep Learning techniques in
Big Data Analytics, particularly since some of the application domains in the works
presented involve large scale data. Deep Learning algorithms are applicable to different
kinds of input data; however, in this section we focus on its application on image,
textual, and audio data.
There are lot of enormous amount of domain specific information collected by public
and private organizations and these information are used for solving problems in
marketing, technology, medical science, national intelligence, fraud detection. The
process of data acquire is really complex and crucial because the data is unlabeled,
uncategorized and immensely complex in handling and analyzing. But deep learning
algorithm is very much use in analysis of such huge volume of data which is
unsupervised. Handling of data with deep learning algorithms continuously improve
the data and making the tools very much useful for analysis of big data. Deep learning
is adapted for handling enormous amounts of data, thus very much capable of
addressing the volume factor. The deep learning algorithm addresses variety factor by
analysis of raw data from variety of formats and sources. Thus it offers unique
solutions to complex problems.
Face book, twitter and many social media sites, shopping portals, traffic
monitoring, security systems, etc. yield information in the form of text, video, audio,
and image. These information extracted by many social media sites, shopping portals,
traffic monitoring, security systems are not only in high volumes but they also have
different formats also. Such data, therefore, can’t be stored as data bit strings.
Its deep learning that make storage efficient and data retrieval easily. Before deep
learning use of raw data was done for data indexing but now with the emergence of
deep data algorithms there is use of high level abstract data representations for
semantic indexing. This representation of semantic indexing by deep learning make
search engines working quickly and efficiently. The semantic indexing approach of
deep learning to represents the data in a way that makes data very much useful source
for discovery of knowledge and understanding. Figure 5 shows the process of semantic
indexing.
186 M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges
ĞdžƚƌĂĐƚĞĚ ^ĞĂƌĐŚŝŶŐ
ͻ ^ŽĐŝĂů ͻ ǀŽůƵŵĞ
DĞĚŝĂ ͻ dĞdžƚ ͻ ǀĂƌŝĞƚLJ ͻ ƋƵŝĐŬ
ͻ ŝŵĂŐĞ ͻ ĞĨĨĞĐŝĞŶƚ
^ĞŵĂŶƚŝĐ
/ŶĨŽƌŵĂƚŝŽŶ
/ŶĚĞdžŝŶŐ
The quick and efficient searching of images, audios, voices is the purpose of big
data so as to increase their accessibility in a quicker and more efficient fashion.
Extraction of nonlinear features from variety of data is obtained by applying complex
deep learning algorithm to big data. Application of non linearity makes deep learning
more closer to artificial intelligence.
This way, data analysts benefit from the vast reserves of knowledge in the pool of
big data. On the other hand, by enabling the application of simple linear analytics, deep
learning offers computational efficiency.
There are lot of users daily connected with internet and the rise of digital content has
been exponential. There are multiple sources from where images and videos are being
uploaded. The repositories of images are massive, For searching of images there should
be improvement in image searches, the browsing process and retrieval should be light
and fast. There is requirement of tagging of images and videos with automation. With
the help of deep learning these images , text, audio, video can be represented in
complex form and can take form of high level of abstraction. Then this representation
can be used for tagging of images and is much benefited for huge data. Figure 6 below
shows the process of semantic tagging.
ǁŽƌĚŽƌƉŚƌĂƐĞ WŚƌĂƐĞŝŶ
ŝĚĞŶƚŝĨŝĞĚ ĚŽĐƵŵĞŶƚŝƐ ůŝŐŚƚWĂƌƐŝŶŐ
ƐĞŵĂŶƚŝĐĂůůLJ ƚĂŐŐĞĚ
There are capabilities in deep learning that holds the potential to guess the unstated
emotions and events in a text. It can identify objects in photos. It can also make
M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges 187
knowledgeable predictions about people’s likely future behavior. All these features
make it a hot property in the fields of intelligence, sales, marketing, and advertising.
No wonder then that Face book has set up an internal team to reap the benefits of deep
learning in their work.
Social media and many platforms are generating data, Companies are mining those data
to extract meaning. With the use of deep learning approach social targeting can be
more specific than traditional style.
With deep learning the analysis of target audience is on the basis of pattern collection
of followers , instead of doing analysis of some percentage of people, analysis is done
on all audience. Deep learning models coming out are more correct as compare to
others.
When deep learning should be adopted in business organization. Below figure7 shows
the situations where deep learning should be adopted:
,ŝŐŚǀŽůƵŵĞ
ĚĂƚĂ
WŽǁĞƌŽĨ
dŝŵĞŝƐŵŽŶĞLJ
dĞĐŚŶŽůŽŐLJ
^ŝƚƵĂƚŝŽŶƐǁŚĞƌĞ
ĞĞƉ>ĞĂƌŶŝŶŐ
ƐŚŽƵůĚďĞ
ĂĚŽƉƚĞĚ
ĐĂŶƚĂĨĨŽƌĚ ZĞƐƵůƚƐ
ƐƚĂŐŶĂĐLJĂŶĚ ƐƵŐŐĞƐƚŶĞdžƚ
ĚƵƉůŝĐĂĐLJ ƐƚĞƉ
ƐƚĂLJĂŚĞĂĚŽĨ
ĐŽŵƉĞƚŝƚŽƌƐ
Hadoop
Hadoop is an open-source software for storing
large datasets of distributed on . clusters. Scaling
up and scaling down of infrastructure is not an
issue. Any kind and size of data can be stored with
Hadoop with high processing power and Hadoop
can virtual task concurrently.
Cloudera
It's an enterprise solution provide better access to
data.
It also provide security for sensitive data.
MangoDB
It's an alternative to relational database
management system. mangoDB is a good option
for semi structure, unstructured and dynamic data.
Talend
It's an open source tool. Maintenance of data base
is a complex task for companies. It allow you to
maintain your own database on Talend.
OpenRefine
It's an open source tool that helps in cleaning of
unstructured data.
multiBamSummary
Its tool used for data integration
PlotProfile
Tools used for data visualization
PlotHeatmap
Tools for data visualization
M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges 189
Now a days there is a need of assessing the information more significantly. Data
science helps analyst to assess the information more significantly and helps to take
decisions more intelligently. In decision making process big data along with data
science provided much stronger platform. The measurement of information now can be
done in a massive amount as well with remarkable precision. This helps analyst to find
more tuned solution and help full in making decision making process.
The Buzz of Big data is becoming popular in thinking long term, decision making and
in beliefs. Big data is used as management tool and gained insight from almost all
industry leaders.
The cost of buying computing and organizing data has been decreased from last few
years and there is sudden rise in adopting Big Data tools. There are variety of data
types and data exist and organization are still in process of thinking how to deal with
this change. The advance technology devices yielding enormous amount of data but the
data science and Big Data helping them to take our most significant useful information.
Below Figure 8 shows concepts related to Big Data and Data science.
ͻƉĂĐŚĞ ͻŶŽǀĂ
ŝŐĂƚĂ
ĂƚĂ^ĐŝĞŶĐĞ
ͻ,ĂĚŽŽƉ ͻ,LJƉŽƚŚĞƐŝƐdĞƐƚŝŶŐ
ͻůŽƵĚƌĂ ͻĞƐĐƌŝƉƚŝǀĞƐƚĂƚŝƐƚŝĐ
ͻZ ͻZĞŐƌĞƐƐŝŽŶŵŽĚĞůƐ
ͻdĂďůĞĂƵ ͻĞĐŝƐŝŽŶdƌĞĞƐ
ͻĞƐĐƌŝƉƚŝǀĞƐƚĂƚŝƐƚŝĐƐ ͻůƵƐƚĞƌŝŶŐ
ͻdžƉůŽƌĂƚŽƌLJĚĂƚĂ ͻdĞdžƚĂŶĂůLJƚŝĐƐ
ĂŶĂůLJƐŝƐ
ͻƵƐŝŶĞƐƐŝŶƚĞůůŝŐĞŶĐĞ
ͻDĂƉƌĞĚƵĐĞ
For making a analytical platform a variety of Machine Learning and data mining
algorithms are available. There are variety of many algorithms are in existence to
190 M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges
handle specifically problems related to business. These algorithm are proved beneficial
in reminding patients medicine to Manufacturing of automobiles.
The purpose of various algorithm are unique. The unique use of each algorithms allow
manager and analytics to find out more promising solution to complex business
problems.
Supervised and unsupervised clustering algorithms have different use in dataset of
organization. In the same way clustering algorithm have different application on
different data set. Below figure 9 shows various machine learning algorithm used in
Big Data.
<ͲDĞĂŶƐ
dĞdžƚ ƐƐŽĐŝĂƚŝŽŶ
ĂŶĂůLJƐŝƐ ƌƵůĞƐ
DĂĐŚŝŶĞ
dŝŵĞƐĞƌŝĞƐ >ŝŶĞĂƌ
ůĞĂƌŶŝŶŐ
ĂŶĂůLJƐŝƐ ƌĞŐƌĞƐƐŝŽŶ
ĂůŐŽƌŝƚŚŵƐ
ĞĐŝƐŝŽŶ >ŽŐŝƐƚŝĐ
dƌĞĞ ZĞŐƌĞƐƐŝŽŶ
EĂŝǀĞ
ĂLJĞƐŝĂŶ
ůĂƐƐŝĨŝĞƌ
Deep learning have special computing power as compare to other machine learning
techniques.
Taken together, these four powers deep learning have strength to generate more correct
models than other methods also it reduces time require for model generation.
There are many variety of variables in data set. Deep learning detects relationship
between data variables. For example, suppose that a medicine causes negative effects in
baby girl infants, but not in kids. A predictive model that incorporates the combined
effect of sex and age will perform much better than a model based on sex alone.
Traditional modeling methods can only find out this , with performing hypothesis
testing manually. Deep learning have power to detect the interactions automatically and
does not depend on analyst.
Deep learning also works best with high-cardinality class memberships. These class
membership are a type of values that have immense number of discrete values. For
examples word is a type that belong to speech class and picture belongs to a large class
of images.
Deep Data is also best suited for handling untagged data. Deep learning also have a
power to learn from this untagged data of articles, news, speech, image etc. Deep
learning can help in finding out useful patterns from such untagged data. Power of
Deep Data is shown in figure 10.
hŶƚĂŐŐĞĚ
ĚĂƚĂ
ĂƐĞŝŶ ĂĐĐĂƵƌĂƚĞ
WŽǁĞƌŽĨ
ŵŽĚĞů ŵŽĚĞůƐŝŶ
ďƵŝůĚŝŶŐ ĞĞƉĂƚĂ ůĞƐƐƚŝŵĞ
ĞƚĞĐƚŝŶŐ
ŝŶƚĞƌĂĐƚŝŽŶ
ĂŵŽŶŐ
ǀĂƌŝĂďůĞƐ
The below table explains the limitations of Deep Data and Big Data
192 M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges
ĞĞƉ ŝŐ
ĂƚĂ ZĞƋƵŝƌĞƚŚŽƵƐĂŶĚ
ĂƚĂ ůĞĂĚƐƚŽĐŚĂŶŐĞŝŶ
ŽĨĚĂƚĂ ĞdžŝƐƚŝŶŐĂƉƉƌĂŽĐŚ
ĂƚĂĨůŽǁŝŶŐŝŶ
džƉĞŶƐŝǀĞƚŽƚƌĂŝŶ
ĐŽŵƉĂŶLJďĞĐŽŵĞ
ŝŶƚĞƌŵƐŽĨ
ĐŽŶƐƚĂŶƚƚŚĞ
ĐŽŵƉƵƚĂƚŝŽŶ
ƉĞƌŝŽĚŝĐ
dŚĞŽƌŝƚŝĐĂů ŶůLJƐŝƐŽĨƌĞĂůƚŝŵĞ
ĨŽƵŶĚĂƚŝŽŶ ĚĂƚĂůĞĂĚƐƚŽ
ŵŝƐƐŝŶŐ ƉƌŝǀĂĐLJŝƐƐƵĞƐ
hŶƌĞƐĂŽŶĂďůĞ
dƌĂŝŶŝŐŵĞƚŚŽĚŝƐ
ƐĞĂƌĐŚĞƐĐĂŶƚďĞ
ŶŽƚĚĞĨŝŶĞĚ
ŚĂŶĚůĞĚ
>ĞĂƌŶŝŶŐŝƐŶŽƚ
ĐŽŵƉƌĞŚĞŶƐŝǀĞ
References
[1] A. Efrati, How 'deep learning' works at Apple beyond. Information, Dec. 2013.
[2] Y. Bengio, "Learning deep architectures for AI", Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1-127,
2009.
[3] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, "Exploring strategies for training deep neural
networks", J. Mach. Learn. Res., vol. 10, pp. 1-40, Jan. 2009.
[4] Bengio Y, LeCun Y: Scaling learning algorithms towards, AI.In Large Scale Kernel Machines Edited
by: Bottou L, Chapelle O, DeCoste D, Weston J. MIT Press, Cambridge, MA; 2007, 321–360.
[5] Bengio Y, Courville A, Vincent P: Representation learning: A review and new perspectives. Pattern
Analysis and Machine Intelligence, IEEE Transactions on 2013,35(8):1798–1828.
doi:10.1109/TPAMI.2013.50 doi:10.1109/TPAMI.2013.50 10.1109/TPAMI.2013.50
[6] Bengio Y: Deep learning of representations: Looking forward. In Proceedings of the 1st
International Conference on Statistical Language and Speech Processing. SLSP’13. Springer,
Tarragona, Spain; 2013:1–37. http://dx.doi.org/10.1007/978–3-642–39593–2_1
http://dx.doi.org/10.1007/978-3-642-39593-2_1 10.1007/978-3-642-39593-2_1
M. Bhatia et al. / Big Data & Deep Data: Minding the Challenges 193
[7] Chen, Xue-Wen, and Xiaotong Lin. "Big data deep learning: challenges and perspectives." IEEE
Access 2 (2014): 514-525.
[8] Manyika, James, et al. "Big data: The next frontier for innovation, competition, and productivity."
(2011).
[9] Chen, Xue-Wen, and Xiaotong Lin. "Big data deep learning: challenges and perspectives." IEEE
Access 2 (2014): 514-525.
[10] Manovich, Lev. "Trending: The promises and the challenges of big social data." Debates in the digital
humanities 2 (2011): 460-475.
[11] Lohr, Steve. "The age of big data." New York Times 11.2012 (2012).
[12] Dawn Nafus, "Deep Data: Notes on the n of 1," in Quantified:Biosensing Technologies in Everyday
Life , 1, MIT Press, 2016, pp.280-
[13] Najafabadi, Maryam M., et al. "Deep learning applications and challenges in big data
analytics." Journal of Big Data 2.1 (2015): 1.
[14] Liu, Weibo, et al. "A survey of deep neural network architectures and their
applications." Neurocomputing 234 (2017): 11-26.
[15] Wang, Lidong. "Machine learning in big data." International Journal of Advances in Applied
Sciences 4.4 (2016): 117-123.
[16] Najafabadi, Maryam M., et al. "Deep learning applications and challenges in big data analytics." (2015).
[17] Kashyap, Hirak, et al. "Big data analytics in bioinformatics: A machine learning perspective." arXiv
preprint arXiv:1506.05101 (2015).
[18] Wang, Lidong. "Machine learning in big data." International Journal of Advances in Applied
Sciences 4.4 (2016): 117-123.
[19] Labrinidis, Alexandros, and Hosagrahar V. Jagadish. "Challenges and opportunities with big
data." Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033.
[20] Marx, Vivien. "Biology: The big challenges of big data." Nature 498.7453 (2013): 255-260.
[21] Kaisler, Stephen, et al. "Big data: Issues and challenges moving forward." System sciences (HICSS),
2013 46th Hawaii international conference on. IEEE, 2013.
[22] Katal, Avita, Mohammad Wazid, and R. H. Goudar. "Big data: issues, challenges, tools and good
practices." Contemporary Computing (IC3), 2013 Sixth International Conference on. IEEE, 2013.
[23] Kitchin, Rob. "Big data and human geography: Opportunities, challenges and risks." Dialogues in
human geography 3.3 (2013): 262-267.
194 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-194
1. Introduction
1
Suryanti Awang, suryanti@ump.edu.my.
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 195
Camera-based able to reduce cost because the hardware requirement is not as expensive
as a sensor-based system. However, camera-based required intelligence techniques to
recognize the vehicle types.
Visually recognizing vehicle types is computationally difficult. On a surveillance
camera, a vehicle of the same class can appear in many variations, and some vehicles
from different class share the same appearance. The environmental conditions can also
change the appearance of a vehicle. Each vehicle can have different body designs,
accessories, and painted decorations. These variations make visual recognition difficult.
However, when a vehicle class shares similar appearance with other vehicle class, these
similarities make the visual recognition more difficult. The surface of a vehicle may
contain a variety of environmental reflection and specular lighting patterns. Harsh
lighting from sunlight produces hard-edged shadows on the road that can change the
shape of a vehicle when appeared on the camera.
More variations and similarities of vehicle classes mean more difficulties to
recognize vehicle types. Thus, the main difficulties for visual recognition of vehicles
are:
i. Vehicles from the same class but has a lot of variations (inter-class variation);
ii. Vehicles from different class but shares similar appearance (intra-class
variation); and
iii. Various environmental conditions such as weather and light illumination.
This chapter discusses techniques to classify vehicles from an image obtained by a
surveillance camera. The second section in this chapter is about the various
implementation of Convolutional Neural Network (CNN) in vehicle recognition system
from the existing related researches. The third section is the detail explanation of the
SF-CNNLS methodology in vehicle recognition system. The next section is about how
the experiments execution and the discussion on the obtained results. The last section
in this chapter is the conclusion of this research.
Figure 1. The convolution process and subsampling process in a stage of CNN [2]
196 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
This section will discuss existing works that used related techniques to classify vehicles
from an image obtained by a surveillance camera into the vehicle classes, for instance,
car class, van class, truck class, etc. to get accurate results for traffic monitoring,
management, and other related applications.
A CNN is a multi-layer feed-forward neural network that is specifically designed
to extract features from two-dimensional data such as an image. Usually, the
arrangement of the CNN layer is in stages. The CNN is commonly used in image
recognition because it can directly process a one channeled image (grayscale image). If
an image has three channels, for example, an RGB image, the CNN can process each
channel in parallel. A few intelligence applications that use CNN are face detection by
Tivive and Bouzerdoum [4], Chen et al. [5], vehicle type recognition by Dong et al. [1,
3], Bautista et al. [6] and document analysis by Simard et al. [7]. The processes of
CNN shown in Figure 2.
Based on figure 1, the input is either an image or a feature map depending on
which stage is deployed. A stage of CNN usually consists of two processes, which are
convolution process and subsampling process. The convolution process is applied to
the input to extract features. The outcome of the convolution process is a set of
convolved images or feature maps Cx, where x denotes the hidden layer level. Secondly,
subsampling process is applied on the Cx. The outcome of the subsampling process is a
set of feature maps Sx+1 which will be used as input for a further stage.
During the convolution process, a trainable filter is required for the convolution
operation that denoted by the asterisk symbol,כ. Some filter techniques that can be
deployed, namely, Independent Component Analysis (ICA), Sparse Coding, Sparse
Autoencoder and Sparse Filtering. The outcome of the convolution operation is a set of
extracted features. The features will be summed with a trainable bias to produce the
output, Cx. The trainable bias is obtained through CNN back-propagation training.
During the subsampling process, a small patch of Cx with size 2 by 2 pixels is
extracted and applied with subsampling process to produce a pixel of Sx+1. The patch
extraction including subsampling process is repeated until all patches from Cx are
extracted. There are four operations involved during the subsampling process which are
summation by its patch to produce a scalar result, multiplied with a scalar weight wx+1,
summed with a scalar trainable bias bx+1, and finally the scalar result is applied with
sigmoid activation function to produce a pixel of Sx+1.
Figure 2 shows the conceptual example of a CNN that contains three channel of
hidden layers with each channel have four hidden layers and one output layer. The
three channels of hidden layers can be performed in parallel. The output layer of the
CNN is the implementation of a classifier such as Feed-Forward Neural Network
(FFNN), Softmax Regression, Support Vector Machine (SVM), etc [17][18].
The hidden layers of the CNN are designed to extract features and reduce
dimensionality of the extracted features to ensure they are robust to geometric
variations. The C1 and C3 are the convolved results while the S2 and S4 are the feature
maps obtained through processes described in the figure 2. At the end of the hidden
layer which is before the extracted features are supplied into the output layer, the
extracted features will be vectorized into a one-dimensional vector. The classifier in the
output layer will use the vector to train the classifier itself, and after the classifier is
trained the vector will be used to perform classification process. The outcome of the
output layer is the classification results that described which class that the extracted
features represent.
Most of the implemented works are CNNs that typically accepted a grayscale
image with fixed size as an input, and the output is the set of sorted images according
to the existing categories [12]. Dong et al. [1] introduced a layer-skipping strategy into
CNN that enable a CNN to extract both low-level and high-level features of a vehicle.
The CNN consists of hidden layer stage 1 and hidden layers stage 2. Each stage of
hidden layer consists of five similar components, which are convolution layer, absolute
rectification layer, contrast normalization layer, average pooling layer and subsampling
layer. The output from the hidden layer stage 1 are forwarded to two different locations,
which are the hidden layer stage 2 and the average pooling with subsampling layer. The
final layer for both stages is vectorization layer that vectorized two-dimensional data
into a fully connected vector. Both vectors will be concatenated into a single vector and
then forwarded to Softmax Regression classifier for classification task. Prior the
convolution layer at the hidden layer stage 1, they add zero padding into the image with
the purpose to make the features of vehicle is consistent. However, this leads to the
distraction of the vehicle features that difficult to differentiate the aforementioned
second issue. They used the BIT-Vehicle dataset [9] and achieved more than 95% of
accuracy based on vehicle class of car, van, truck, and bus. However, taxis were
classified as car class.
Rong and Xia [11] also proposed a CNN extending the layer-skipping strategy
from Dong et al. using a Sparse Auto Encoder (SAE) as the trainable filter for the
convolution layer. This CNN framework consisted of three stages of hidden layers,
which are layer one, layer two, and layer three. Each stage of the hidden layer consists
of convolutional layer and pooling layer. The output from each stage of hidden layer
will be forwarded to Principle Component Analysis (PCA) layer. In the PCA layer, the
output from each hidden layer are vectorized and concatenated into a single one-
dimensional vector. The vector is compressed with PCA to reduce its dimensionality.
Finally, the vector fed to the classifier for classification. The vehicle dataset from
Caltech, which has two classes (car and van) and no taxi image was used. They tested
the performance using three different classifiers namely Softmax Regression, SVM,
and DBN. The Softmax Regression is able to surpass other 2 classifiers with the
accuracy of 84%.
Huo et al. [12] proposed a multitask of Region-based CNN (R-CNN) model with
one input layer, a convolution stage, and a multi-task stage. There are some
components with different matrices inside the convolution stage which are four
198 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
maximum pooling layers and three convolution layers. Inside the multi-task stage, there
are one fully connected layer and three output layers to classify vehicle into sedan, van,
bus, and truck. The multitask in this work defined by the labels that they used in the
output layers which is to classify the vehicle based on vehicle class, different angle
view (front, side, and back) and daylight and nightlight. With their proposed model,
83% accuracy is achieved for the classification purpose. Another recent
implementation of CNN in vehicle type classification is done by Wang et al., where the
CNN is used to detect vehicle from video image [13]. The vehicle images are trained in
two types of learning namely fully connected CNN and Extreme Learning (ELM).
Vehicle features known as CNN features are extracted from the fully connected CNN.
Additional samples are used to train the ELM in obtaining the features from ELM
known as weak labels. These two types of features are classified using adaptive
clustering to classify the vehicle into class of compact car, mid-size car (van) and
heavy-duty. In this research, they considered taxi as the compact car class. They
managed to achieve 85.56% of accuracy based on front and back angle of vehicle
images.
Therefore, based on these related works, we can conclude that most of the recent
works implemented deep learning method especially CNN in their research. However,
none of them classify the vehicle that involves in the intra-class variation issue as
different class, for example, taxi as taxi class and not the same class as car.
The next subsections describe the stages of the proposed SF-CNNLS system based on
the overall SF-CNNLS shown in figure 3.
The vehicles are recorded using a surveillance camera for the BIT dataset and the self-
obtained dataset known as SPINT dataset. The recorded video frames containing
frontal-view of a vehicle are selected. From each selected video frames, a region
containing vehicle is manually cropped to ensure the classification process is focused
on the vehicle. The illustration of the cropping procedure shown in figure 4. Size and
aspect ratio of a cropped image is varied for every selected video frames, due to
variations in vehicle size. The cropped image is stored into training and testing dataset.
Figure 4. Example of image from the video frame and the cropped image
3.2. Pre-Processing
There are five steps in the pre-processing procedure that has been used based on a
combination of existing works [1][6][15][16], which are resizing with maintained
aspect ratio, converting from RGB color space to grayscale, Histogram Equalization
(HE), normalizing to zero mean and unit variance, and Local Contrast Normalization
(LCN). Flowchart for the pre-processing procedure shown in figure 5. According to
that flowchart, the first process is to fetch an input image from either training or testing
dataset. The fetched image is resized while maintaining its aspect ratio and then
converted from RGB color space to grayscale. HE is then applied to normalize by using
OpenCV 2.4 built-in HE function. The purpose of applying HE is to ensure its pixels
intensity value will uniformly distributed. The resulting image will be normalized to
zero mean and unit variance. LCN is applied as the final process to eliminate unwanted
features such as light illumination and shading.
Resize an image while maintaining its aspect ratio is implemented by resizing its
longest side to a specified length and then resize its shortest side by dividing the
specified length with its aspect ratio. The procedure shown in figure 6. To demonstrate
the procedure, supposed an input image with size 220 by 322 pixels is chosen to be
resized, for example, to 256 pixels while maintaining its aspect ratio. Based on the size
200 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
of the input image, the longest side will be its height (322 pixels), and the shortest side
will be its width (220 pixels). The longest side will be resized to 256 pixels, whereas
the shortest side is to the lower boundary of 256 multiplied by its aspect ratio which
results in 174 pixels. Thus, the new size of the image is reduced from 220 by 322 pixels
to 174 by 256 pixels without altering its aspect ratio. Maintaining aspect ratio during
resizing is to ensure the shape of the vehicle remains consistent as possible.
Normalizing the input image to zero mean and unit variance is done by subtracting
with its mean and divided by its standard deviation. The first step is to calculate the
mean and standard deviation of the input image by using equation (1) and (2)
respectively. With the mean and standard deviation obtained from the calculation, the
second step is to subtract the input image with mean and then divide it with standard
deviation. The resulting image will have zero mean and unit variance. This
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 201
normalization method used in any machine learning tasks. Without normalization, the
training process will be difficult to achieve convergence.
H W
∑ ∑ xi , j
i j
μ = (1)
n(x )
∑ ∑ (x i, j − μ )
HW
2
i j
σ = (2)
n(x )
where μ is mean, σ is standard deviation, xi, j is pixel value at index i and j of the input
image, n(x ) is cardinality of input pixels, while H and W is height and width of the
input image respectively.
The last step in this pre-processing is applying LCN. The LCN consists of two
equations which are subtractive normalization (equation 3) and divisive normalization
(equation 4).
4 4
z i , j = xi , j − ∑ ∑ w p,q xi + p, j + q (3)
p = −4 q = −4
z i, j
y i, j = (4)
max (M , M (i, j ))
4 4
M (i, j ) = ∑ ∑ w p,q z i2+ p, j + q (5)
p = −4 q = − 4
M =1 (6)
where x is the input image, w p,q is normalized Gaussian filter with size 9x9 and z is an
output of the subtractive normalization. The input image consists of i and j that denote
its row and column index respectively. The subtractive normalization (equation 3)
firstly applied on the input image. Later, divisive normalization (equation 4) applied on
the resulting image. Thus, the final output from this phase is a pre-processed image that
will be used to extract the vehicle features.
In this process, the feature from the pre-processed image will be extracted by using
trained CNNLS framework. To explain how the the CNNLS framework implemented
the feature extraction, suppose that the components of the CNNLS framework will be
firstly explained based on figure 4. The CNNLS requires two stages of optimized
Sparse Filters to extract local and global features from the pre-processed input image.
The hidden layers stage 1 and stage 2 consist of five components which are
convolutional layer, Absolute Value Rectification (AVR) layer, Local Contrast
Normalization (LCN) layer, average pooling layer, and subsampling layer without zero
padding. The difference between the implementation of both hidden layers is the input.
During the hidden layer stage 1 at the convolutional layer, the input will be the pre-
processed image, whereas, for the convolutional layer in the hidden layer stage 2, the
202 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
input replaced with a local feature extracted from the hidden layer stage 1. Here, the
extracted local feature is convolved with the optimized stage 2 Sparse Filter to extract
global features.
The next hidden layer is post hidden layer. The components in the post hidden
layer are different for stage 1 and stage 2 hidden layers. Post hidden layer for stage 1
contains average pooling and subsampling layer with zero padding, while for stage 2, it
contains only subsampling layer with zero padding. Please note that the process in each
component is similar according to the related components in each layer and the output
from each component will be an input into another component.
During convolutional layer in stage 1 hidden layer, the pre-processed image will
be convolved with the optimized stage 1 Sparse Filter to extract local features. The
convolutional operation demonstrated in figure 7. Equation 7 and 8 used for the
convolution operation.
y i = sig (x ⊗ s i ) (7)
1
sig (x ) = (8)
1 + exp(− x )
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 203
where x is an input image from the pre-processed image, si is optimized sparse filters,
yi is extracted features and sig (.) is sigmoid activation function.
The next component is absolute value rectification (AVR) layer that applies
absolute value operation on the extracted feature by using equation 9. The output will
have absolute value elements. It is inspired by a biological system where human eyes
do not perceive images in negative values.
y i = xi (9)
where others symbols are similar to the equation 3,4 and 5 except k is number of
features, s3 is number of filters and M is the maximum value in that group (4x4xs3).
In the average pooling layer, average filtering is convolved on the extracted
features to make it less sensitive to variation in angle and size of a vehicle. Figure 9
shows the illustration of average pooling process.
Next is the subsampling layer with zero padding is applied on the extracted
feature to ensure every feature has the same size and has an aspect ratio of 1:1 prior the
classification process by Softmax Regression. The procedure for the subsampling with
zero paddin shown in figure 10.
To demonstrate the procedure, suppose that an input feature with size 158 by 240
pixels reduced to 64 pixels. Firstly, the input feature resized to 64 pixels with
maintained aspect ratio. Secondly, the shortest side of the resized input feature will be
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 205
padded with zero pixels to ensure the shortest side has the same length with the longest
side. Thus, the resized feature will have an aspect ratio of 1:1. It is to avoid
inconsistency in the extracted features size which will results in a fatal error during the
matrix multiplication with Softmax Regression weight.
The last component is the fully connected where the extracted features from both
hidden layers and post hidden layer concatenated into a one-dimensional vector. Figure
11 shows a demonstration on how the fully connected vector formed. Suppose that the
feature from hidden layer stage 1 consists of element {1,2,3,..,9} and the feature from
3×3
stage 2 hidden layer consists of element {A, B, C ,.., I } . Each feature will be firstly
3×3
vectorized into a one-dimensional vector and then concatenated to form a single one-
dimensional vector.
In this section, we will describe the Softmax Regression used as the classifier. The
classifier is a simple classifier where it is sufficient when the extracted features are
discriminant [19]. The vehicle features that extracted from CNNLS will be used by this
classifier to classify the vehicle. Classifying vehicles worked by executing Softmax
Regression hypothesis, hθ in equation 14.
( )
⎡ Pr y = 1 x;θ , b ⎤ (
⎡ exp θ1T x + b1 )⎤⎥
⎢
hθ = ⎢
( )
Pr y = 2 x;θ , b ⎥
⎥
= K
1
⎢
( T
⎢ exp θ 2 x + b1 )⎥ (14)
⎢ ⎥
(
T
⎥ ∑ exp θ j x + b j ) ⎢
⎢
⎥
⎥
⎢
( )
⎢⎣Pr y = K x;θ , b ⎥⎦ j =1 (
⎢⎣exp θ kT x + bk )⎥⎦
206 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
This section describes the unsupervised training that is executed according to that path
shown in figure 4. Note that, before this unsupervised training is a part of training
methods for the CNNLS. The purpose of the unsupervised training is to generate two
stages of optimized sparse filters, while the purpose of supervised training is to
generate the trained weight and bias that will be used by the Softmax Regression for
vehicle classification task.
The unsupervised training is performed to optimize stage 1 sparse filter and later
use the optimized stage 1 sparse filter to train the stage 2 sparse filter. A set of pre-
processed input images is first delivered into Sparse Filtering function to generate a set
of optimized stage 1 sparse filters. CNN needs trainable filters to extract features that
inspired by the visual cortex in the human brain. In this research, we choose to use
Sparse Filtering function due to its advantage of the fastest filtering function in
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 207
unsupervised training. After that, the set of pre-processed input images is delivered into
CNNLS stage one hidden layer and convolved with the optimized stage 1 sparse filters.
The output from the CNNLS stage one hidden layer is used as input for the Sparse
Filtering function to generate a set of optimized stage 2 sparse filters. Thus, the
outcome of the unsupervised training is a set of optimized stage 1 sparse filters and a
set of optimized stage 2 sparse filters. The components of the Sparse Filtering function
shown in figure 12.
Based on figure 12, the Sparse Filtering function has a pre-processing phase
before the optimized Sparse Filters are produced. The pre-processing phase has two
processes which are splitting a set of input images (or a set of features) into small
patches, and then applying zero mean and unit variance normalization on each patch.
After pre-processing phase, the optimizing sparse filters function will be executed to
initialize a set of unoptimized sparse filter, and then apply optimization on the
unoptimized sparse filter to generate a set of optimized sparse filters.
The splitting process in the pre-processing phase illustrated in figure 13, where
given a set of t input images (or input features), each input image with size w x h split
into m small patches with the size of p x p. The stride size denotes by notation s, is the
step size for patch extraction.
The second process in this pre-processing is normalizing input patches using zero
mean and unit variance normalization is used for normalizing the input patches.
Normalization is done in two steps. The first step is a normalization across the patches
and the second step is a locally normalization for each patch. These normalization
techniques called as global and local centering.
Next, the process of optimizing sparse filter is executed where this process will
optimize a set of the unoptimized sparse filter into a set of the optimized sparse filter.
Note that this process has a few steps as shown in the flowchart depicted in figure 14
where optimizing sparse filter require several initialization processes. The processes are
initializing sparse filter with random normal distribution number, vectorized input
patches into one-dimensional vectors, and configure Broyden-Fletcher-Goldfarb-
Shanno (BFGS) optimization algorithm. The BFGS implementation used is based on
GNU GSL package [13]. Optimizing Sparse Filter is done by minimizing its objective
equation 15 through backpropagation method using BFGS optimization algorithm. The
optimization process is executed iteratively on a forward and backward pass for 200
iterations. Usually, Sparse Filtering optimization is achieved at the acceptable
convergence of 200 iterations based on our observation. When the 200 iterations
reached, we normalized each sparse filter in the range of [-1 – 1] to avoid exponent
overflow during the convolution layer in the hidden layer stage 2. This normalization
implemented while storing the optimized sparse filters.
C C
~ (i )
f
∑ fˆ (i ) =∑ ~ (i ) (15)
i =1 1 i =1 f
2 1
(i )
~ f
where C is a number of input patches and f (i ) obtained from in which f (i ) is
(i )
f
2
soft-absolute activation function based on equation 16.
208 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
f (i )
(
= ε + w T x (i ) )2
(16)
where ε is 10 −8 , w is the weight that known as Sparse Filter and x is vectorized input
patches.
During forward pass, the Sparse Filtering objective equation 14 calculated. To
(i )
execute the equation 14 f is firstly calculated using soft-absolute function (equation
(i ) ~ fj
15). Then, the f normalized across all examples by its L2-norm, f j = f , where
j 2
~
f j(i ) represents the jth feature value (rows) for the ith example (columns). The f j then
~ (i )
ˆ (i ) = f
f
normalized per example, ~ (i ) . Finally, the absolute value of all entries in
f
2
fˆ (i ) is summed up, fˆ (i ) . During backward pass, the procedure from forward pass is
1
calculated in the backpropagation method, to find the amount of gradient needed to
update the w parameter in equation 15.
Figure 15 shows the optimized sparse filters for stage 1. For this research, the
optimized sparse filters that we obtained are 81 with the size of 9x9 each. While figure
16 shows the optimized sparse for stage 2 that convolved from the sparse filters in
figure 15. It consists of 9 sparse filters with the size of 9x9 each.
210 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
The path for supervised training shown in figure 3 and all the components that this
training executed are explained earlier except the training for Softmax Regression.
Here we explain about how the Softmax Regression is implemented in the supervised
training. The Softmax Regression will be trained to produce trained weights and biases.
The training is done by minimizing both negative log-likelihood equation (equation 17)
and Mean Squared Error (MSE) equation (equation 18) by using gradient descent
method. Minimizing the negative log-likelihood equation is equivalent to minimizing
Kullback–Leibler divergence which is used by Dong et al. [1]. The optimization is
executed for 10000 iterations to ensure the weight θ k and bias bk achieved
convergence. Equation 19 and 20 are derivative equations for updating weight and bias.
1 ⎡M K ⎤ λ
J (θ ) = −
M
( )
⎢ ∑ ∑1{yi = k }log Pr yi = k xi ;θ , b ⎥ + θ
2
2
(17)
⎣i =1k =1 ⎦ 2
K M 2
MSE =
1
[ ( )
∑ ∑ Pr y i = k xi ; θ , b − 1{y i = k }
K • M k =1i =1
] (18)
1 M
∇ θ k J (θ ) = − [ ( ( ))]
∑ xi 1{y i = k }− Pr y i = k xi ; θ , b + λθ k
M i =1
(19)
1 M
∇ bk J (θ ) = − [ (
∑ 1{y i = k }− Pr y i = k xi ; θ , b
M i =1
)] (20)
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 211
( )
Pr y i = k xi ; θ , b =
(
exp θ (k )T x (i ) + b (k ) ) (21)
K
∑
j =1
(
exp θ ( j )T x (i ) + b ( j ) )
where K is a number of vehicle classes, M is a number of training features from all K
vehicle classes. x ∈ ℜ d , k = 1,2,.., K is the vehicle features extracted from CNNLS with
d is a number of elements in the vector x. 1{yi = k } is indicator function where it
returns 1 if yi is equal to k, otherwise returns 0. θ k ∈ ℜ d , k = 1,2,., K is weight
and bk ∈ ℜ d , k = 1,2,.., K is bias for all vehicle classes. While y is the actual result and λ
is a non-negative regularization parameter.
The derivative of equations 19 and 20 are calculated to minimize the output of the
negative log-likelihood (equation 17) for each iteration. The aim is to iteratively update
the weight θ k and bias bk in the equation 17. The MSE equation is calculated for each
end of the iteration to monitor the supervised training progress. Usually, the MSE value
after 10000 iterations may achieve an acceptable amount of 0.2% error. Note that the
Softmax Regression used in this research contains a non-negative regularization
parameter, denoted as λ . The purpose of the regularization parameter is to control the
generalization performance of the Softmax Regression. The larger the λ value, the
bigger the amount of generalization. The λ value needs to be chosen such that the
classification of the Softmax Regression is not too general and not too specific to the
training dataset. A small amount of generalization could improve the classification
accuracy because the classification will not too dependent on the training dataset.
Usually, the value of λ is chosen somewhere between 0:20 to 1:00. Setting λ to zero
will disable the regularization.
3.7. Testing
In this section, the procedure for testing the trained CNNLS is discussed. Recall that
the purpose of this testing is to evaluate the performance of the CNNLS including the
Softmax Regression classifier. According to the testing path in figure 3, the testing
flow is exactly similar to the supervised training path, except that the input images
fetched from the testing dataset. The trained weights and biases for the Softmax
Regression are loaded from the supervised training to the testing, and the hypothesis
equation (equation 14) is calculated instead of minimizing the negative log-likelihood
equation (equation 17).
The testing procedure can be demonstrated in such suppose that the Softmax
Regression is trained to recognize for three types of vehicle class which are car, taxi,
and truck. Testing is performed on each vehicle class dataset. The testing results are
observed based on true-positive and false-positive. The true-positive is known as
accuracy performance whereby the vehicle type correctly recognized in the predicted
class. The false-positive is the vehicle type that successfully recognized in unpredicted
class. The true-positive and false-positive are calculated using equation 22 and 23
respectively for every vehicle class testing dataset. The recognition result for every
vehicle class will be displayed in confusion matrix.
212 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
n
TruePositive = ×100% (22)
N
u
FalsePositive = ×100% (23)
N
In this section, we test the SF-CNNLS methodology and evaluate its performance. We
used a benchmark database from the BIT-Vehicle [1] and a self-obtained database
named the SPINT-Vehicle databases. The BIT database has various vehicle types
commonly used by people as well as the SPINT-Vehicle. Table 1 below is the
distribution of dataset in the training and testing for each vehicle class for each
database. Please note that for the training, number of samples for each class is
preferable to be balanced to avoid any bias during the training. Thus, we used 40
samples for the dataset as the training dataset for each class, whereas, for the testing set,
we used various number of samples for each class in each database. It shown that the
taxi class has fewer samples in the BIT-Vehicle dataset, whereas, the truck has fewer
samples in the SPINT-Vehicle dataset. For condition of the vehicle images, BIT and
SPINT captured images from frontal angle of surveillance camera during daylight and
with slight illumination and shadow. In SPINT, we obtained the vehicle data during
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 213
Car
BIT Taxi
Truck
Car
SPINT Taxi
Truck
214 S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition
Based on that setting, we split our experiments into two main experiments; the
first experiment is to obtain the True Positive for each class in each database and the
second one is to obtain the False Positive. Thus, for the True Positive, there are 3
experiments and 6 experiments for the later. Table 3 is the True Positive for each class
as well as the average performance, while table 4 shows the False Positive that
represented in the confusion matrix.
Based on the results in table 3 and 4, shown that our method able to classify the
vehicle class especially the taxi as different class from car with 89% and 75% of true-
positive in respective dataset. However, 11% and 19% of the taxi class misclassified as
car in which the highest False Positive compared to other classes. For car, 92.5% is
correctly classified with 4% misclassified as taxi class and 3.5% as truck class in BIT
dataset, whereas, 98.5% of car is correctly classified with 6% misclassified as taxi class
and none of the car is classified as truck in SPINT dataset. Overall, SPINT dataset
produced a better result compared to BIT due to the illumination in SPINT dataset is
less than BIT dataset. In terms of processing time, our methodology able to recognize
the testing samples in 10 minutes for BIT dataset and 10 minutes 15 seconds for SPINT
dataset.
The above results obtained when we test with car without sunroof. Thus, Softmax
Regression able to differentiate taxi and car as different class. However, when we
consider car with sunroof in the car class, the results are significantly dropped although
the number of car with sunroof is limited. It is due to the features of sunroof is almost
similar to the taxi sign on top of taxi roof. Therefore, Softmax Regression unable to
discriminate the features effectively. Table 5 and 6 show the results that we obtained
when car with sunroof is considered.
5. Conclusion
Vehicle type recognition consists of sensor-based and camera-based. The vehicle type
recognition system that used in this research is a camera-based and requires artificial
intelligence technique to classify the vehicle types. CNN, which is one of the deep
learning approaches, is chosen for this research to classify the vehicle types.
Specifically, we used a modified CNN with Layer Skipping (CNNLS) strategy. Some
changes such as in the pre-processing image to maintain the original features. Thus, the
low-level and high-level of features of vehicle will be extracted significantly without
the distraction of zero padding during the convolution layer. We also modified the
subsampling layer with zero padding at the end of the post hidden layer for the hidden
layer stage 1 to avoid a matrix multiplication error during the classification process in
Softmax Regression. The results show promising performance when we tested our
methodology with standard car type, taxi, and truck. We used a benchmark dataset that
consists of all the mentioned vehicle classes called BIT. We also have the self-obtained
dataset that consists of similar vehicle classes called SPINT. The datasets consist of
frontal view of vehicle that obtained from a mounted camera from a surveillance
camera in daylight. The difference among these two dataset is SPINT has less
illumination condition compared to BIT due to the different period of time that we
gather the data. We used 40 samples for each vehicle class in each dataset as the
training sample and various number of vehicle class samples for the testing during our
experiment. From the experiment, it shows that our methodology is able to correctly
classified taxi as different class of car in which all the related researches classify the
taxi as the same class of car. It is due to the almost identical features that taxi and car
shared. To conclude, vehicle type can be enhanced the accuracy performance if more
discriminant features are extracted and trained.
S. Awang and N.M.A. Nik Azmi / SF-CNNLS for Intra-Class Variation of Vehicle Type Recognition 217
References
[1] Z. Dong, M. Pei, Y. He, T. Liu, Y. Dong and Y. Jia, Vehicle type classification using unsupervised
convolutional neural network. 22nd IEEE International Conference on Pattern Recognition (ICPR)
(2014), 172–177.
[2] I. Arel, D.C. Rose and T.P. Karnowski, Deep machine learning - a new frontier in artificial intelligence
research. IEEE Computational Intelligence Magazine, 5(4) (2010), 13–18.
[3] Z. Dong, Y. Wu, M. Pei and Y. Jia, Vehicle type classification using a semisupervised convolutional
neural network. IEEE Transactions on Intelligent Transportation Systems, 16(4) (2015). 2247–2256.
[4] F.H.C. Tivive and A. Bouzerdoum, A new class of convolutional neural networks (siconnets) and their
application of face detection. Proceedings of the International Joint Conference on Neural Networks,
vol 3. (2003), 2157–2162.
[5] Y.N. Chen, C.C. Han, C.T. Wang, B.S. Jeng and K.C. Fan, The application of a convolution neural
network on face and license plate detection. 18th International Conference on Pattern Recognition
(ICPR’06), vol 3, (2006), 552–555.
[6] C.M. Bautista, C.A. Dy, M.I. Manalac, R.A. Orbe and M. Cordel, Convolutional neural network for
vehicle detection in low resolution traffic videos. IEEE Region 10 Symposium (TENSYMP) (2016),
277–281.
[7] P.Y. Simard, D. Steinkraus and J.C. Platt. Best practices for convolutional neural networks applied to
visual document analysis. ICDAR, vol. 3 (2003), 958–962.
[8] J. Ngiam, Z. Chen, S.A. Bhaskar, P.W. Koh and A.Y. Ng, Sparse filtering, Advances in neural
information processing systems, (2011), pages 1125–1133.
[9] Z. Dong, Y. Wu, M. Pei and Y. Jia, Vehicle type classification using a semisupervised convolutional
neural network, IEEE Transcations on Intelligent Transportation Systems, (2015), page in press.
[10] H.T. Phan, A.T. Duong and S. T. Tran, Hierarchical sparse autoencoder using linear regression-based
features in clustering for handwritten digit recognition. 8th International Symposium on Image and
Signal Processing and Analysis (ISPA) (2013), pages 183–188.
[11] H. Rong and Y. Xia, A vehicle type recognition method based on sparse auto encoder, Proceedings of
the International Conference on Computer Information Systems and Industrial Applications, 2015,
DOI: 10.2991/cisia-15.2015.88
[12] Z. Huo, Y. Xia and B. Zhang, Vehicle type classification and attribute prediction using multi-task
RCNN, 9th International Congress on Image and Signal Processing, BioMedical Engineering and
Informatics (CISP-BMEI) (2016), pp. 564-569.
[13] B. Gough, GNU Scientific Library Reference Manual - Third Edition, Network Theory Ltd., 2009.
[14] S. Wang, F. Liu, Z. Gan and Z. Cui, Vehicle type classification via adaptive feature clustering for traffic
surveillance video, 8th International Conference on Wireless Communications & Signal Processing
(WCSP) (2016), Yangzhou, pp. 1-5.
[15] N. Pinto, D.D. Cox and J.J. DiCarlo, Why is real-world visual object recognition hard?, PLOS
Computational Biology, 4(1) (2008), 1-6.
[16] Y. Tang, C. Zhang, R. Gu, P. Li and B. Yang, Vehicle detection and recognition for intelligent traffic
surveillance system. Multimed Tools Appl., 243 (2015).
[17] A. Chantakamo and M. Ketcham, The multi vehicle recognition using hybrid blob analysis and feature-
based. Proceedings 7th International Conference on Information Technology and Electrical
Engineering: Envisioning the Trend of Computer, Information and Engineering, ICITEE. Chiang Mai,
Thailand; (2015) 453-457.
[18] N. Najva and K.E. Bijoy, SIFT and Tensor Based Object Detection and Classification in Videos Using
Deep Neural Networks, Procedia Computer Science, 6th International Conference On Advances In
Computing & Communications, ICACC, 6-8. vol 93. Cochin, India, (2016) 351-358.
[19] S. Awang, R. Yusof, M. F. Zamzuri and R. Arfa, Feature Level Fusion of Face and Signature Using a
Modified Feature Selection Technique, International Conference on Signal-Image Technology &
Internet-Based Systems, (2013), pp. 706-713.
218 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-218
1. Introduction
Subsequently, engineered feature extraction and classification is performed for each can-
didate region.
The target locations estimated by the ATD are given as input to subsequent algo-
rithms like object recognition for high-level analysis. Face recognition and face verifi-
cation, an example of object recognition, is an important component of visual surveil-
lance. In face verification, two face images are compared and verified if they belong to
the same person. The general pipeline of the face verification is as follow: face patches
are detected and aligned firstly, then high-dimensional over-complete visual features are
extracted from each aligned face patches, after which they are fed into the similarity
metrics or binary classifiers to obtain the verification result.
In the last few years, deep learning algorithms have received significant attention,
especially the Convolutional Neural Networks (CNNs). Many researchers have started
to employ the CNN on their ATD and face recognition applications. In this chapter, we
present two such CNN-based applications, namely, automatic military target detection
and deep face recognition. We illustrate the advances made by deep learning techniques
in automatic visual surveillance and security systems. In the first CNN-based applica-
tion, we present a state-of-the-art military target detector which integrates a novel im-
age fusion method and the CNN. In the second CNN-based application, we present a
CNN-based face recognition system, where we incorporate the recent advances in deep
learning.
The chapter is organized as follows: Section 2 introduces the proposed application
of military automatic target detection in details. Section 3 illustrates the face recognition
application. For each application, we describe their background, detailed method, exper-
iments and summary in several subsections. Finally, we conclude both applications in
Section 4.
2.1. Background
Automatic target detection (ATD) is key to automatic military operations and surveil-
lance missions. In a military mission, sensors can be placed on the ground or mounted on
unmanned aerial vehicles(UAVs) and unmanned ground vehicles (UGVs) to acquire in-
formation. The acquired sensory data will be processed using an ATD algorithm, which
aims to find bounding boxes to localize the target. Fast and accurate object detectors can
increase the survivability of the weapon platform and soldiers. The automatic identifica-
tion of the threat in military scenario is of the essence.
In recent years, several ATD algorithms have been proposed. Generally, these algo-
rithms can be classified into two main categories: 1) background modeling approaches,
2) and learning-based approaches.
Background modeling approaches assume that background pixels have a similar
color (or intensity) over time in a fixed camera, and the background model is built on
this assumption. The background is abstracted from the input image, and the foreground
(moving objects) region is determined by marking the pixels in which a significant dif-
ference occurs. In [39], the authors modeled the background using a kernel density es-
timation (KDE) method over a joint domain-range representation of image pixels. Mul-
220 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
Figure 1. Left: the appearance of target is undistinguishable from background environment. Right: the
scale of target is various dramatically.
Several image fusion based methods were proposed to enhance target representation
in literature [3] [31] [17] [42]. Multiple images acquired with different range of elec-
tromagnetic spectrum were fused into one image by pixel-level image fusion algorithms
such as principal components analysis (PCA) based weighted fusion [42]and discrete
wavelet transform (DWT) [31], and then fed into an ATD system. However, the tradi-
tional fused images are limited in the performance for ATD tasks. To address the seri-
ous limitation, we propose a novel image fusion approach to improving detector perfor-
mance in the military scenario, which exploits the significant advantage of the unsuper-
vised feature learning characteristic of CNNs. Compared with high-level image fusion,
the proposed method can achieve a higher accuracy and computational efficiency. In ad-
dition, we adopted the state-of-the-art generic object detection framework into the mili-
tary scenario and used a cross-domain transfer learning techniques to cover the shortage
of insufficient data. In this way, the proposed framework achieves promising results on
the military sensing information analysis center (SENSIAC) dataset.
2.2. Methods
Figure 2. The pipeline of proposed object detector framework, which include four main components:1) image
fusion, 2) CNN feature extractor, 3) ROI proposal, and 4) ROI classification and regression
In this section, we will elaborate the overall proposed ATD system. As you can see
in the Figure 2, the whole system is composed of four modules: 1) an image fusion
module, which can fuse three different type of images into a BGR image; 2) a CNN
feature extractor, used for extracting high-level semantic representations from the fused
image; 3) a region of interest (ROI) proposal module manipulated on fused image is
utilized for generating hundreds or thousands of candidate bounding boxes, for each ROI
on feature map produced by feature extractor module; and 4) a ROI classification and
regression is performed to obtain fine bounding boxes and corresponding class.
gated two type of images from different sensors, mid-wave infrared image (MWIR) and
visible image (VI), respectively. In addition to the images acquired from these two sen-
sors, we also incorporate motion image generated from two consecutive visible frames
in order to complement sufficient description of objects.
MWIR: Depending on the different range of electromagnetic spectrum, the infrared
(IR) spectrum can be divided into different spectral bands. Basically, the IR bands in-
clude the active IR band and the passive (thermal) IR band. The main difference between
active and passive infrared bands is that the passive IR image can be acquired without
any external light source. The passive (thermal) IR band is further divided into the mid-
wave infrared (3-5 um) and the long-wave infrared (7-14 um). In general, the mid-wave
infrared (MWIR) cameras can sense temperature variations over targets and background
at a long distance, and produce thermograms in the form of 2D images. Its ability to
present large contrasts between cool and hot surfaces is extremely useful for many com-
puter vision tasks such as image segmentation and object detection. However, the MWIR
sensor is not sensitive to cool background. Besides, the texture corresponded to the high
frequencies of the objects and they were mostly missed, because of the possible absence
of auto-focus lens capabilities and low-resolution sensor arrays.
VI: The range of electromagnetic spectrum of visible image is from 380 nm to 750
nm. This type of image can be easily and conveniently acquired via various kinds of
general cameras. In comparison with MWIR image, the VI image is sensitive to illumi-
nation changes, preserve high-frequency information and can provide a relatively clear
perspective of the environment. In most of the computer vision topics, the VI image has
become major focus of interest for many decades. Thus, there are a large number of pub-
lic VI datasets across many research areas. On the other hand, the significant drawbacks
of VI image are that it has poor quality in the harsh environmental conditions with un-
favourable lighting and pronounced shadows, and there is no dramatic contrast between
background and foreground when the environment is extremely complicated such as the
battlefield.
Figure 3. The procedure of motion estimation: where t is the current frame and t-5 is the previous 5th frame,
and the motion is what our algorithm estimate.
Motion Image: In general, the moving objects are the targets in the battle fields.
Therefore, estimating the motion of objects can provide significant cues to segment those
targets. Various motion estimation algorithms have been proposed in recent decades,
such as dense optical flow methods, points correspondence methods, and background
subtraction. And each of them has shown effectiveness on many computer vision tasks.
However, considering the trade-off between accuracy and computational complexity, we
do not opt for any of the complicated motion estimation approaches but utilize a straight-
forward and easier to be implemented method. The method is illustrated in Figure 3,
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 223
where we estimate the motion map based on two consecutive frames. To be specific, the
objective images are sampled at every 5 frames, and then force the current frame to sub-
tract the last frame, the resulting image is the desired motion image. The method can be
formulated as follow:
where Mn (x, y) represents the motion value of frame n at pixel point (x, y) and In (x, y)
denotes the pixel value of frame n at pixel point (x,y).
In this way, we do not need to consider multiple frames to estimate background, like
the background subtraction methods, and only the subtraction operator is employed in
this procedure, which is more efficient that other state-of-the-art methods. Even though
this method can introduce lots of noise in the motion image, the noisy motion image with
complementary information is still useful for subsequent fusion.
Fusion Methodology:
Figure 4. Illustration of different information fusion architectures: (a) pixel-level fusion architecture; (b) fea-
ture-level fusion architecture; (c) decision-level fusion architecture.
Here, we formalize the possible configurations of information fusion for object de-
tection into three categories, namely, pixel-level fusion architecture, feature-level fusion
architecture and decision-level fusion architecture. An illustration is shown in Figure 4.
Having these possibilities in mind will help to highlight the important benefits of our
proposed fusion method in terms of efficiency and effectiveness.
Pixel-level Fusion: A typical pixel-level fusion architecture is illustrated in Figure
4-a. Where it deals with the pixels obtained from the sensor directly and tries to improve
the visual enhancement. Typically, multiple images from different sources are combined
into one single image in a pixel-wise manner, after which it is fed into the object detection
system to generate final results. One of the main advantages of the pixel-level fusion
methods is the low computational complexity and easy implementation.
Feature-level Fusion: As an higher level fusion system, one might pursue the
feature-level fusion shown in Figure 4-b, in which different type of images are simultane-
ously fed into their independent lower part of the entire object detection system, which is
typically called feature extractor. For instance, this lower-level system might be the hand
engineered feature extractor for the traditional object detection system and high-level
convolution layer for the CNN-based system. After which the concatenated features pro-
duced by the various independent lower systems are fed into one upper decision-making
224 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
system to produce the final results. Although this feature-level fusion is usually robust
to the noise, it is computationally expensive to deal with feature fusion procedure in a
parallel fashion, especially for the CNN-based methods.
Decision-level Fusion: The decision-level fusion scheme illustrated in Figure 4(c)
which operates on the highest level, and refers to fusing discriminate results from differ-
ent systems designed for various type images. Note that for an object detection system
based on learning algorithms, the decision-level fusion does not capture the intrinsic re-
lationship between different images. In addition, this method might also be difficult to
implement, apart from being computationally expensive.
In our framework, we propose a novel image fusion approach. As shown in the im-
age fusion module in Figure 2, firstly, the three type of raw images (MWIR, VI and Mo-
tion image) are concatenated into a BGR-style three-channel image where MWIR is in
the red channel, motion image in green channel and VI in the blue channel. After the
fused image is obtained, we fed it into a convolutional neural network (CNN) for train-
ing our object detector in an end-to-end manner. Meanwhile, the feature from different
source images can be fused together in the last feature maps of the CNN in an unsuper-
vised learning fashion. Therefore, compared with feature-level and decision-level fusion
methods, our approach is easier to implement and low computational complexity. And
for pixel-level fusion, we employ unsupervised learning style to fuse images from dif-
ferent sources instead of utilizing hand-engineered pixel-level fusion algorithms such as
discrete wavelet transform (DWT) pixel-level image fusion methodologies.
Table 1. Network configuration: The complete network architecture contains two modules, where the first
module is called CNN feature extractor which includes 5 convolutional layers (conv 1-5), while the second
module is the ROI classification and regression which has a ROI pooling layer and 4 fully connected layers.
Name Conv1 Norm1 Pool1 Conv2 Norm2 Pool2 Conv3 Conv4 Conv5 ROI Pooling FC6 FC7 Cls Bbox
Input Channels 3 96 96 96 256 256 256 512 512 512 36 4096 1024 1024
Output Channels 96 96 96 256 256 256 512 512 512 36 4096 1024 2 8
Kernel Size 7×7 3×3 5×5 3×3 3×3 3×3 3×3 6×6
Type conv LRN max-pool conv LRN max-pool conv conv conv ROI-pool fc fc fc fc
Stride 2 2 2 2 1 1 1
Pad 1 1 1 1
Activation function relu relu relu relu relu
Dropout
We propose to address this issue and transfer the CNN parameters trained on large-
scale public visible image datasets, such as ImageNet [35] to the fused dataset. The
VGGM model is pre-trained on the large-scale RGB image dataset, ImageNet, which
contains common objects and scenes observed in daily life. These pre-trained weights are
then used to initialize the weights of the CNN being trained on the fused dataset. During
the fine-tuning process, we initialize the weights of ”conv1” to ”conv5”. Additionally,
unlike some prior work, we do not freeze the lower layers of CNN during the fine-tuning.
This enables our network to adapt to the data distribution of the fused data.
Loss Functions
As shown in Table 1, the network has two output heads. The first is for classifying each
ROI, which will output a discrete probability distribution over two categories( back-
ground and target). And the second is for regressing the bounding box offsets of ROI
where for each category, it will output a tuple of (tx ,ty ,tw ,th ), the elements indicate the
shift value relative to the central coordinate, height and width of original proposal ROI.
Similar to [14], the following negative log likelihood objective is used for classifi-
cation:
where, p represents the predicted probability of one of categories and u is the ground
truth class.
For regression, the smooth L1 loss function is used:
in which t u is the bounding box offsets of the u class. And v is the true offsets.
In the training stage, the objective functions are combined as following:
u = 1 means only when the class is target, the bounding box regression can be trained.
The λ is used to control balance between classification and regression, we set it as 1 in
all experiments.
2.3. Experiments
2.3.1. Dataset
We evaluate our approach on a public released ATR database from the military sens-
ing information analysis center (SENSIAC). This database package contains 207 GB of
MWIR imagery and 106 GB of visible imagery along with ground truth data. All im-
agery was taken using commercial cameras operating in the MWIR and visible bands.
The types of targets include people, foreign military vehicles, and civilian vehicles. The
datasets were collected during both daytime and night and the distance between cameras
and targets varied from 500 to 5000 meters.
In our experiments, we only consider the vehicle objects, and split the 5 types of
vehicles into training targets and 3 types of vehicles into testing targets. The details
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 227
of the split are shown in Figure 5. We selected each type of vehicles with 3 range of
distances between the camera and the target (1000, 1500 and 3000 meters). The dataset
contains 4573 training images and 2812 testing images corresponding to every 5-th frame
sampled from the original dataset. It should be noted that we consider all the types of
vehicle as one class, ”vehicle”. Consequently, the problem becomes a binary (vehicle
and background) object detection problem.
2.3.3. Evaluation
Metrics:
For all of metrics, we consider detections as true or false positives based on whether the
area of overlap with ground truth bounding boxes exceed 0.5. The overlap area can be
calculated by the below equation:
area(B p ∩ Bgt )
ao = (5)
area(B p ∪ Bgt )
228 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
Where ao denotes the overlap area, B p and Bgt denote the predicted bounding box and
ground truth bounding box, respectively.
Mean Average Precision(mAP) is a golden standard metric for evaluating the per-
formance of an object detection algorithm, where it first calculates the average precision
(AP) of each class and then averages all the obtained AP values. Because there is only
one class (vehicle) in our experiments, we select AP as one of the evaluation metrics.
The AP value can be easily obtained by computing the area under the precision-recall
curve.
Top1 Precision is a metric that is widely used in classification tasks, where the prob-
ability of multiple classes is predicted and the class having the highest score is selected.
Subsequently, the top1 precision score is calculated, which corresponds to the number of
predicted labels that match the target label, divided by the total number of data. In our
case, there is only one target in each image. Thus, we can employ top1 precision metric
in experiments to evaluate the performance of our framework in a practical scenario.
Figure 6. Average precision (AP) comparison between different experimental designs. Independent input of
visible, MWIR and Motion image, fusion image of visible and MWIR image (Visible-MWIR), fusion image
of visible, MWIR and Motion (3-Channels) and decision-level fusion, respectively
the feature map. The final output also confirms the reasoning that the enhanced feature
map can boost the performance.
2.4. Summary
In this work, an unsupervised learning based image fusion method is proposed to inte-
grate the ATD network, which fused visible, MWIR and motion information effectively.
We further adopted state-of-the-art generic object detector for the battle field object de-
tection. We also utilized cross-domain transfer learning techniques to deal with the in-
sufficient data by training the model on large-scale visible image dataset firstly and then
230 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
Figure 7. Example of the visualised results of our framework. Example 1 and 2 demonstrate the performance
with varied inputs on large and small object detection, respectively. Different columns denote the different types
of input image. The raw input image, generated feature map and the final output are showed in consecutive
rows. In the final output image, the green bounding box represent the position of object predicted by system.
fine-tuning on the small-scale fused image dataset. The proposed framework was evalu-
ated with the SENSIAC dataset. It achieved 98.34% average precision and 98.90% top1
accuracy. However, the computational times is still high for real-time applications. This
remains a topic for our future work.
3. Case2-Face Recognition
3.1. Background
Face recognition in unconstrained images has remained as one of the main research top-
ics in computer vision for decades, due to its numerous practical applications in the area
of information security, access control, and surveillance system.
There are two well-known paradigms of face recognition, which are identification
and verification. In identification, a set of representations for specific individuals to be
recognized is stored in a database. At test time, when a new image with people is pre-
sented, the system will check if the test people is present in the stored database. If present,
the identity of the test individual is retrieved. However, the task of verification system
is to analyze two face images and decide whether they represent the same person or
two different people. Generally, neither of the images to be tested are from any training
dataset. Because the face verification only need to compare the two presented images, it
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 231
can yield results more quickly than the identification system. In this chapter, we focus on
the face verification tasks.
Some of the challenges in the face recognition systems include variation in the ap-
pearance of the same person due to different poses, illuminations, expressions, ages, and
occlusions. Therefore, reducing the intra-personal variations while increasing the inter-
personal differences is a primary task in face recognition.
Most recently, deep Learning has significantly improved the state-of-the-art in many
applications. In the world of face recognition, deep models such as CNNs have been
proved effective for extracting high-level visual features. Thus boosting the performance
of face recognition significantly [51][45].
In this chapter, we present a unified system for face recognition in unconstrained
conditions, which contains a face detection, face alignment and face verification module.
The overall system is inspired by the recent advances in deep learning community, such
as the Inception-ResNet architecture, center loss, Multi-Task CNNs, etc. In the stage of
experiments, we compared the presented method with the state-of-the-art on the widely
used Labeled Faces in the Wild (LFW) benchmark and showed its effectiveness.
A face recognition system typically includes three essential cascaded components: 1)face
detection, 2) face alignment, 3) and face verification. In below sections, we will review
the literature related to these components separately.
networks. [61] uses the same network structure with [44] for face landmark localization
with multitasks which named TCDCN. Shao et al. [38] adopts a deep network learning
deep representation from coarse to fine for face alignment.
3.3. Methods
streams. To be specific, there are three convolutional layers in the trunk steam and with a
max pooling layer followed the first convolutional layer. Each convolutional layer has the
same receptive field with 3 × 3 and a different number of filters (10, 16, 32 respectively).
In this network, the parametric rectified linear unit (PReLU) is applied as non-linearity
activation function after the convolution layers. In the ”head” streams, there is a 1 × 1 × 2
convolution for classification,and a 1 × 1 × 4 convolution for estimating the bounding
box regression vectors.
R-Net: All the candidates generated by the P-Net are fed to the refine network (R-
Net), which further rejects a large number of false candidates. The architecture of R-
Net is slightly different than the P-Net. Where it also has three convolutional layers in
the ”trunk” stream, but there is an additional max pooling layer followed by the second
convolutional layer and a size of 128 fully connected layer added at the end of ”trunk”
stream. For the ”head” streams, compared with that in P-Net, they were replaced by two
fully connected layers with the size of 2 and 4, respectively. Note that the input of the
R-Net requires the size of 24 × 24.
O-Net: The final network in the cascaded CNNs framework is the output network
(O-Net), which aims to identify face regions with more supervision. In particular, the
234 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
network will output five facial landmarks. In the architecture of the O-Net, there are four
different sizes of convolutional layers in the ”trunk” stream and followed by a fully-
connected layer with the size of 256. In the ”head” steams, there are three fully connected
layers which are used to classification, bounding box regression and facial landmark
localization.
Training:
The three networks (P-Net, R-Net, and O-Net) are trained with multi-task losses: face
classification, bounding box regression, and facial landmark localization.
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 235
where pi is the probability produced by the network that indicates sample xi being a face.
i ∈ {0, 1} denotes the ground truth label.
The notation ydet
Bounding Box Regression: For each candidate bounding box, the offset between it
and the nearest ground truth is predicted. The learning objective is formulated as follow:
where ŷbox
i is the regression target obtained from the network and ybox
i is the ground truth
coordinate. There are four coordinates, including left top, height and width.
Facial Landmark Localization: Similar to bounding box regression task, facial
landmark detection is formulated as a regression problem:
where ŷlandmark
i is the coordinates of facial landmarks obtained from the network and
ylandmark
i is the ground truth coordinate for the ith sample.Five facial landmarks are con-
sidered :left mouth corner,right mouth corner, nose, left eye, and right eye.
As you can see in the face verification module in the Figure 8, there are three main parts
in the proposed face verification module: (1) deep feature extractor, (2) feature space
embedding, and (3) feature distance measurement.
Deep Feature Extractor: Given an image x, we strive for a feature extractor g(x)
which is able to transform x into a feature space Rd . Inspired by the Inception-ResNet
[48], the networks are very deep and wide, in the sense that they comprise a long se-
quence of Inception modules in a ResNet fashion. Such CNNs have recently achieved
state-of-the-art performance in the ImageNet classification challenge. [35]. The large
scale structure of the Inception-ResNet architecture is given in Figure 10 and the detailed
structure of its components is illustrated in Figures 111213. The structure of Inception-
ResNet primarily includes a stem, three Inception clusters and two reduction modules.
The stem component is a set of cascaded convolution, which aims to extract low-level
features. Moreover, each Inception cluster comprise of several Inception blocks.
All the convolutional layers, including those inside the Inception-ResNet mod-
ules, use the rectified linear activation (Relu). The input image’s size in our network is
128 × 128 with RGB color channels and mean subtraction. As shown in Figure. 12, in-
stead of using the original Inception [49], the cheaper but effective Inception blocks is
implemented in the Inception-ResNet. It should be noted that the last convolutional layer
in each Inception blocks is a 1 × 1 filter expansion layer, which is designed for increas-
ing the dimensions of the filter bank before the addition to match the depth of the input.
However, there is a reduction module followed by the first two Inception clusters, which
is used for reducing the dimension of the feature maps.
236 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
Feature Space Embedding: The output of the stage of the deep feature extractor
is a high-dimension feature vector in the space Rd . For face verification task, the deeply
learned embedding need to be not only discriminative but also compact. Thus a embed-
ding function f (x) ∈ Re is needed, from a high-dimension feature vector into a compact
low-dimension embedding. To end this, we follow the ideas in [36] using a fully con-
nected layer and a L2 normalization to directly learn a projection from the feature space
Rd to embedding space Re (e is 128 in our experiments).
Feature Distance Measurement: The squared L2 distances in the compact embed-
ding space directly corresponds to face similarity: faces of the same person have small
distances and faces of distinct people have large distance. In this way, the face verifi-
cation task becomes straightforward, which simply involves thresholding the distance
between the two embeddings.
Training:
Constructing highly efficient loss function for discriminative feature learning in CNNs is
non-trivial. In this application, we utilized the ”center loss” proposed in [55] to efficiently
bring the learned embedding with the discriminative power. Specifically, a class center
vector, which has the same dimension as the embedding feature is learned. In the stage
of training, the distances between the embedding features and their corresponding center
vectors are minimized. Note that the center vectors will be updated at the same time. We
train the entire network by the joint supervision of the center loss as well as the softmax
loss. And there is a hyperparameter for balancing the two supervision signals. Basically,
the center loss efficiently forces the embedding features of the same class to their centers
while the softmax loss enable the embedding features of different classes keeping large
distances. Under the joint supervision, We can not only enlarge the inter-class features
variations , but also decrease the intra-class features differences. The center loss function
is formulated as follows:
1 m
Lc = ∑ xi − cyi 22
2 i=1
(9)
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 237
(a) Inception-ResNet-A.
(a) Reduction-A.
(b) Reduction-B.
The cyi ∈ Re denotes the yi th class center of embedding features. The formulation aims
to reduce the intra-class variations. Furthermore, the joint center loss and softmax loss is
formulated as follows:
T
m
eWyi xi +byi λ m
L = Ls + λ Lc = − ∑ log + ∑ xi − cyi 22 (10)
W T x +b 2
i=1 ∑nj=1 e j i j i=1
The Ls and Lc indicate the softmax loss and center loss, respectively. And we use the
scalar λ for balancing the two loss functions.
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 239
3.4. Experiments
3.4.1. Dataset
We use CASIA-Webface [56] as our training set and evaluate on the labeled faces in the
wild (LFW)[27] dataset which is the de-facto academic test set for face verification in
unconstrained conditions. The CASIA-Webface consists of 494,414 images of 10,575
subjects, and LFW contains 13,233 face images of 5749 identities collected from the
Internet. For comparison purposes, algorithms typically report the mean face verification
accuracy.
3.4.2. Performance
In Table 3, we compare the performances between our method and other state-of-the-art
methods on the LFW face verification benchmark. We follow the standard protocol for
unrestricted, labeled outside data. Even thought it is well-known that an ensemble of
multiple models can improve the overall performance slightly, we only compare CNN-
based single model methods to focus on the power of network design. It is clear that
the performance will increases typically with the scale of training dataset. With larger
dataset like the FaceNet, we can achieve the best performance in LFW (260M images
of 8M subjects). In our experiments, our model is only trained on the public dataset
CASIA-Webface. In comparison with the methods[57][56] which were also trained on
the CASIA-Webface, our method achieved state-of-the-art performance with 98.80 ±
0.005 accuracy. It should be noted that the human-level performance LFW is 97.5% [26],
the presented system is able to surpass it by a large margin.
Besides the performance, another important factor is the time complexity. We eval-
uated the running time of each module (face detection with alignment, and face verifi-
cation) as shown in Table 4. The application was implemented by using the tensorflow
deep learning tool [1] and running on a machine with an NVIDIA GeForce GTX 1080
GPU, a Intel Core i7 CPU and 32 GB memory. In testing stage, two images with size
of 250 × 250 were feed into the application. As shown in the Table 4, it takes 1.092
second for face detection and alignment, 1.067 second for face verification. So, totally
240 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
the Deep Residual Face application takes around 2.159 second. To meet the real-time
requirement in a real scenario, we need to do more work on improving the speed of our
face recognition application in the future.
3.5. Summary
In this case study, a CNN-based face recognition system is proposed to show the effec-
tiveness of deep learning on security applications. We adopted the state-of-the-art face
detector, Multi-Task CNNs, to perform face patch detection and alignment. Furthermore,
the recent proposed Inception-ResNet CNN architecture was utilized for deep features
extracting. With the help of L2 embedding procedure and center loss training tricks,
the proposed face recognition achieved a comparative results on the well-known FLW
benchmark.
4. Conclusion
In this chapter, two cases, military automatic target detection and face recognition, were
presented to show the effectiveness of using deep learning for surveillance and security
applications. In the first case, a CNN-based military ATD method using a new deep image
fusion technique was proposed, which fused multi-channel images within a CNN to en-
hance the significance of deep features and achieved promising results on the SENSIAC
dataset. In the second case, we also presented a novel face recognition system motivated
by the recent advances in the deep learning community, where the Inception-ResNet
CNN architecture was utilized to extracting deep features and the center loss function
was adopted for training the face verification network. Our novel face recognition system
is able to surpass human-level performance on the face verification benchmark LFW.
References
[1] Martn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Cor-
rado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur,
Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay
Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
Xiaoqiang Zheng, and Google Research. TensorFlow: Large-Scale Machine Learning on Heterogeneous
Distributed Systems.
[2] Melih S. Aslan, Zeyad Hailat, Tarik K. Alafif, and Xue-Wen Chen. Multi-channel multi-model feature
learning for face recognition. Pattern Recognition Letters, 85:79–83, 2017.
[3] Gaurav Bhatnagar and Zheng Liu. A novel image fusion framework for night-vision navigation and
surveillance. Signal, Image and Video Processing, 2015.
[4] Leo Breiman. Randomforest2001. pages 1–33, 2001.
S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications 241
[5] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the Devil in the
Details: Delving Deep into Convolutional Nets. arXiv preprint arXiv: . . . , pages 1–11, 2014.
[6] Bo-hao Chen and Shih-chia Huang. Probabilistic neural networks based moving vehicles extraction
algorithm for intelligent traffic surveillance systems. Information Sciences, 299:283–295, 2015.
[7] Sumit Chopra Raia Hadsell Yann LeCun. Learning a Similarity Metric Discriminatively, with Applica-
tion to Face Verification.
[8] Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. Active Appearance Models.
[9] Navneet Dalal, Bill Triggs, Oriented Gradients, Detection Cordelia, Navneet Dalal, and Bill Triggs. To
cite this version : Histograms of Oriented Gradients for Human Detection. pages 886–893, 2010.
[10] Mark Everingham, Luc Van Gool, Christopher K I Williams, John Winn, Andrew Zisserman, M Ev-
eringham, L KU Van Gool Leuven, Belgium CKI Williams, J Winn, and A Zisserman. The PASCAL
Visual Object Classes (VOC) Challenge. Int J Comput Vis, 88:303–338, 2010.
[11] Pedro F Felzenszwalb, Ross B Girshick, David Mcallester, and Deva Ramanan. Object Detection with
Discriminatively Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(9):1–20, 2009.
[12] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient Graph-Based Image Segmentation. Interna-
tional Journal of Computer Vision, 59(2):167–181, 2004.
[13] Francisco, Javier López-Rubio, and Ezequiel López-Rubio. Foreground detection for moving cameras
with stochastic approximation. Pattern Recognition Letters, 68:161–168, 2015.
[14] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision,
2016.
[15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation.
[16] Jing-ming Guo, Senior Member, Chih-hsien Hsia, Member Ieee, Yun-fu Liu, and Student Member.
Fast Background Subtraction Based on a Multilayer Codebook Model for Moving Object Detection.
23(10):1809–1821, 2013.
[17] Ju Han and Bir Bhanu. Fusion of color and infrared video for moving human detection. Pattern Recog-
nition, 40:1771–1784, 2007.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recog-
nition. Arxiv.Org, 7(3):171–180, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convo-
lutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 37(9):1904–1916, 2015.
[20] Wu-Chih Hu, Chao-Ho Chen, Tsong-Yi Chen, Deng-Yuan Huang, and Zong-Che Wu. Moving object
detection and tracking from video captured by moving camera. 2015.
[21] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio
Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding *.
[22] Huaizu Jiang and Erik Learned-Miller. Face Detection with the Faster R-CNN.
[23] Michael Kass and Andrew Witkin. Snakes: Active Contour Models. International Journal of Computer
Vision, pages 321–331, 1988.
[24] Scale-invariant Keypoints and David G Lowe. Distinctive Image Features from. International Journal
of Computer Vision, 60(2):91–110, 2004.
[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolu-
tional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012.
[26] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and Simile Classi-
fiers for Face Verification.
[27] Erik Learned-Miller, Gary Huang, Aruni Roychowdhury, Haoxiang Li, Gang Hua, and Gary B Huang.
Labeled Faces in the Wild: A Survey.
[28] Yunzhu Li, Benyuan Sun, Tianfu Wu, and Yizhou Wang. Face Detection with End-to-End Integration
of a ConvNet and a 3D Model.
[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro
Perona, Deva Ramanan, C Lawrence Zitnick, and Piotr Dolı́. Microsoft COCO: Common Objects in
Context.
[30] Etai Littwin and Lior Wolf. The Multiverse Loss for Robust Transfer Learning.
[31] Yifeng Niu, Shengtao Xu, Lizhen Wu, and Weidong Hu. Airborne infrared and visible image fusion
for target perception based on target region segmentation and discrete wavelet transform. Mathematical
242 S. Liu et al. / On the Prospects of Using Deep Learning for Surveillance and Security Applications
[60] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint Face Detection and Alignment using
Multi-task Cascaded Convolutional Networks. IEEE Signal Processing Letters, PP(99):1499–1503,
2016.
[61] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. LNCS 8694 - Facial Landmark
Detection by Deep Multi-task Learning. 2014.
244 Deep Learning for Image Processing Applications
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-61499-822-8-244
Super-Resolution of Long
Range Captured Iris Image
Using Deep Convolutional
Network
Anand DESHPANDEa, Prashant P. PATAVARDHANb
a
Department of Electronics &Communication Engineering, Angadi Institute of
Technology and Management, Belgaum, India/Research Scholar, Department of
Electronics and Communication Engineering, Gogte Institute of Technology,
Belgaum, India.
E-mail:deshpande.anandb@gmail.com
b
Department of Electronics and Communication Engineering,
Gogte Institute of Technology, Belgaum,
India.
Abstract This chapter proposes a deep convolutional neural network based super-
resolution framework to super-resolve and to recognize the long-range captured
iris image sequences. The proposed framework is tested on CASIA V4 iris
database by analyzing the peak signal-to-noise ratio (PSNR), structural similarity
index matrix (SSIM) and visual information fidelity in pixel domain (VIFP) of the
state-of-art algorithms. The performance of the proposed framework is analyzed
for the upsampling factors 2 and 4 and achieved PSNRs of 37.42 dB and 34.74 dB
respectively. Using this framework, we have achieved an equal error rate (EER) of
0.14%. The results demonstrate that the proposed framework can super-resolve the
iris images effectively and achieves better recognition performance.
1. Introduction
The identification of the people when they use their credit cards, pass through airports
or other high-secured areas or when they log into computers, just to name a few
situations, is paramount today. User credentials such as username and password are
used whereas in other cases personal identification cards are used. The problem with
the user name and password is they can be forgotten and whereas, the identity cards can
be stolen. The need for improvement in the human identification is required to develop
new techniques, which are reliable and more accurate than the traditional ones.
_________________________
1
Anand Deshpande, Research Scholar, Department of Electronics and Communication Engineering, Gogte
Institute of Technology, Belgaum, E-mail: deshpande.anandb@gmail.com
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 245
Biometrics is one of the areas that demand more research in the science and
technology. Based on individual behavioral and physiological characteristics,
biometrics is used for authentication purpose through an automated method. Iris
biometrics, which is considered as the most robust and highly accurate method of
recognition has popularly been employed in airports for access control, security in
general, border vigilance, etc. Many researchers are working towards recognition of iris
images captured from a long distance, approximately 3 meters.
Stop and stare is the technique used for most conventional short-distance
recognition of the iris. As a result, the participants face inconvenience. A diameter of
200 pixels of the iris is required for recognition, according to the International
Standards Organization (ISO). However, this resolution is suitable for a distance up to a
few centimeters. The following are the major problems iris recognition systems face for
short distances:
a) Closer acquisition distance;
b) Delay in acquisition time, which leads to low throughput; and
c) Standstill of the participants is required during acquisition.
New techniques are focusing more on overcoming these limitations, which can
function at a distance. The resolution of the captured iris image plays a vital and
troubled role.
Long distances present additional challenges. Illumination causes dilation and
contraction of the iris due to a change in the brightness so that the recognition
procedure will be degraded, and the images could be unusable due to poor illumination.
Other challenging tasks for the iris image acquisition are tracking, deciding and
capturing the iris depending on where and when a person is. An iris acquisition system
is illustrated in the Figure1.
Figure 1. An iris being probed from a distance and on the move recognition system [1]
Most remote iris recognition systems are slight variations of the framework from
Figure 1 and have five major components:
• Camera for human detection and tracking from a low-resolution (LR) wide
field-of-view (FOV).
• Camera for face detection and tracking from an LR wide FOV.
• Camera for capturing irises from a high-resolution (HR) narrow FOV.
246 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
• Pan-tilt head for simultaneous movement of the face and the iris.
• Controllers for illumination.
The first iris recognition system was proposed in [2]. By using infrared camera and
telescope, their system could image the iris at a distance between 5 and 10 meters. The
illumination was provided with the help of an880nm collimated illuminator. Across a
typical iris diameter, the system could acquire iris images with 128-pixel resolution.
For recognition purposes, the resolution is good enough. However, the constraints
involved with the participants in this system were (i) fixed position of the head, and
(ii) staring at the fixed spot. On the other hand, by increasing the vertical range of
acquisition and the depth of the field the constraints were tried to relax by the other
researchers. Thus, to achieve this resolution the remote iris recognition system should
be carefully designed. The images captured at a long range suffer from LR, which leads
less information or loss of detail.
The loss of pixel resolution reduces the iris recognition system’s performance. To
overcome this issue, super-resolution (SR) techniques are used. The analyses of still
images and video involve huge amounts of data and rich information, as they are the
main carriers of perceived data. The acquisition of image and video has become very
convenient due to the development of smart mobile phones. Most visual information
has to be acquired and stored in LR because of the limitation of the bandwidth and
storage capacity. Based on LR images, an HR version of the scene has to be built to
display the scene on an HD screen.
The size and number of the sensors account for the resolution of the captured
image. Thanks to constraints on the sensor resolutions and functionalities, an HD image
cannot always be available [3]. Due to the generation of the shot noise, the image
quality gets degraded. To enhance the spatial resolution, another approach is by
increasing the chip size, which as a result leads to increase in capacitance. This method
is not considered to be effective since large capacitances complicate the transfer rate of
charge [4].
SR can remediate the problems discussed before. However, since it is a quite
intensive computational technique, deep neural networks arise as an option to super-
resolve and to recognize the long range captured iris images efficiently. This chapter is
structured as follows. Section 2 discusses the SR framework applied to the iris
recognition. Section 3 suggests a deep neural network implementation for the iris
recognition problem when there are long distances. Section 4 discusses the results
obtained using the proposed system. Finally, the conclusion is discussed in the
Section5.
To overcome the intrinsic limitations of the imaging systems and the sensors, the new
approaches in increasing spatial resolution are required. In many commercial
applications, the high cost of the precision optics and image sensors play an important
factor. A promising approach to constructing an HR image from one or more LR
observations is achieved by means of image processing. There is a natural loss of
spatial resolution when capturing an image using digital systems caused by optical
distortion, lack of adequate focus, motion blur due to limited shutter speed or relative
movement between the camera and the object, diffraction limits, aliasing, sensor noise,
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 247
transmission problems and insufficient sensor density. Aliasing distorts the high-
frequency contents of the image when it is sampled at low rates. Improvement in
resolution by applying image processing tools has been a topic of great interest. High
spatial resolution images can be obtained from one or more LR images with SR. The
high-frequency content lost during the image acquisition process has to be recovered by
SR techniques. From under-sampled LR observations, the main concern of the super
resolution algorithm is to reconstruct HR images. It produces high-quality images from
blurred, noisy and degraded images. The characteristics of the technique overcoming
the inherent resolution limitation of LR imaging systems underlines and represents the
word ‘super‘. Some of the advantages of SR are: i) low costs, ii) existing LR systems
can be used without any additional hardware, and iii) it offers flexibility.
It is necessary to have a detailed understanding of how images are captured and the
transformation they undergo in order to apply a SR algorithm. As said before, the HR
image goes through a sequence of degradations such as blur, additive noise, and down-
sampling during image acquisition. Due to camera motion like zooming, tilting and
panning the frames captured using video camera could be rotated and scaled. Thus due
to relative motion between the observations, blur may be introduced. Hence, the
observed images account to degraded versions of the HR images. A mathematical
model that represents the image acquisition process has to be formulated to analyze the
SR reconstruction problem.
This model, known as forward model or observation, relates the original HR image
to be observed and the LR image(s). The important role in the success of any SR
approach is the correct formulation of the observation models. Translation, blur,
aliasing, and noise are the standard forward models for SR reconstruction. Fig.2
illustrates the typical SR model.
s Y
Sampling Registration Blur Down-
Sampling +
Here, is desired HR image which is band limited and sampled above Nyquist
rate, Yk is the kth observed LR output. Let camera captures LR images of , where the
LR observations are related with the HR [5] scene by:
where D is the down-sampling operator, Bk is the blur for the kth LR image, Mk contains
the motion data, and Nk is the noise in the kth LR image. Notations used in this equation
are as below.
Upper case bold letters and denotes the vector form in lexicographical order for
HR and LR images respectively. Lower case denotes a vector form.
The existing SR algorithms are classified as interpolation, reconstruction, and
example learning based methods. Interpolation-based SR methods interpolate the HR
248 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
image from the LR input using a smooth kernel function [6] [7] [8] [9] [10]. These
methods are fast in operation, but fails to recover high-frequency details and
responsible to introduce zigzag artifacts. The reconstruction-based methods [11] [12]
[13] [14] [15] [16] [17] [18] [19] [20] provides solution for an ill-posed inverse
problem of up-sampling, deblurring and denoising for high quality image by using
smoothness priors. But these methods produce watercolor-like artifacts and with
increase in magnification factor the quality of images gets degraded. Because the
richness of images is complicated to capture analytically, researchers are exploring a
learning-based approach for super-resolving images [21] [22] [23] [24]. This method
consists of two phases:
• Training: Image patches dictionary is created.
• SR: Up-samples the input image patches using the dictionary.
Two types of dictionaries can be created, external and internal dictionary. The external
directory, which can be built in advanced using a set of external training images and
internal dictionary built without using other images. The internal dictionary uses the
property of self-similarity, in which structure of image repeats within and across an
image of different scales.
Learning-based SR approach uses function dependent priors to gather the
unknown HR image [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35]. The input
LR image is divided in overlapping or non-overlapping patches. Then the best match is
selected from the training database. The output HR image is reconstructed using
corresponding HR patches.
Deep learning approaches have been successfully applied for image restoration
problems. Various methods have been proposed for image SR based on the deep neural
network model. These attained better performance than other models. The author [36]
proposed single image SR method based on a deep convolutional neural network
method, which studies a continuous mapping between the LR or HR images. The
author also showed that the sparse-coding based SR methods could be observed as a
deep convolutional network. This structure has a lightweight and processes fast for
real-time usage. The author explored numerous structures of network and parameter
settings and analyzed the performance and speed. The author also extended this work
to manage red, green and blue color channels and showed better performance regarding
the reconstruction quality. The author [37] proposed deep learning and gradient
transform based single image SR method. The LR images are upsampled by
convolutional neural network and gradients of the up-scaled images are calculated and
transformed to required gradients by using gradient transformation network which
establish the energy function of reconstruction. Finally, the energy function is
optimized to estimate the HR image. This approach produces sharp HR image along
with ringing artifacts. The author [38] proposed a method for light-field image SR
using a deep convolutional neural network. The proposed method analyzed for real-
world applications such as refocusing and depth map estimation. The author [39]
shows that the sparse coding model can be casted as a neural network and trained in a
cascaded structure from end-to-end. The author [40] proposed a novel super resolution
method with multi-channel convolutional neural networks. In this method, depending
on the positions of pixels output pixels are classified into four groups. These groups are
created into 2×2 magnified image. This architecture is independent of the use of
Bicubic interpolation method. Using this approach, the average PSNR is higher than
0.39 dB than the conventional SRCNN.
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 249
Light
Database Size Camera Distance
Wavelength
CASIA V1 756 CASIA NIR Short
CASIA V2 225 CASIA NIR Short
CASIA V3 22051 OKI iris pass-h NIR Short
CASIA V4 2576 CASIA NIR Long
MMU1 450 LG EOU2200 NIR Short
MMU2 995 Panasonic BM ET 100 US NIR Short
BATH 16000 ISG LW 1.3 S 1394 NIR Short
WVU 3099 OKI iris pass-h NIR Short
MBGC Video MBGC-Portal NIR Long
ICE1 2900 LG EOU2200 NIR Short
ICE2 75000 LG EOU2200 NIR Short
UBIRIS v1 1877 NIKON E5700 Visible Short
UBIRIS v2 11357 Canon EOS 5D Visible Long
Sony DXC 950P 3CCD with Visible
UPOL 384 Short
TOPCON TRC501A
QFIRE05 Video Dalsa 4M30 NIR Long
QFIRE07 Video infrared camera NIR Long
QFIRE11 Video Dalsa 4M30 NIR Long
The Table 1 shows that the CASIA V4 dataset is more suitable for the proposed
research objectives as it contains NIR images captured at a long distance.
3. Proposed System
Most of the SR approaches are work towards remote sensing application. Very less
amount of work has been carried out in the field of biometric using SR based on
convolution neural network. In this proposed work we have used convolution neural
network based SR method to recognize the long-range captured iris image. The
proposed system to fuse and to super-resolve the iris image as shown in Figure 3
Eye
Iris Extraction
Normalization
Super Resolution
Feature Extraction
Recognition
In this block, AdaBoost [52] [53] approach is used to extract face and eye by using face
and eye-pair classifiers. This method is low complex and gives high detection accuracy
at high speed. This is due to the use of Haar features and a cascaded classifier structure,
which excludes the image window assumption rapidly. The off-angle and blinked
images are detected by correlating the image with averaged image [54]. The correlation
coefficient,, is given by
∑ ∑ ̅
ൌ (2)
∑ ∑ ̅ ∑ ∑
The specular reflection may affect the segmentation of iris images. This can be
minimized or removed as below.
• Find the positions of specular reflections by applying the thresholding.
• Fill the reflected regions by applying in-painting.
• Apply median filter to smooth and to remove the boundary artifacts
introduced by inpainting method.
After removing the specular reflection, the iris part is extracted as below.
• Eyebrow and eyelashes are removed via thresholding as these contain low
intensity.
• The pupil is obtained by using geometric properties and extracting the
connected components having maximum height and width.
• Draw horizontal lines from the center of the pupil towards outside to the sclera
region.
• Check the change in intensity and the point where the intensity increases
quickly should be the location where those lines hit the boundary, as the sclera
emerges brighter than the iris.
• Calculate the meeting point of the vertical bisectors of the lines that pass
through the pairs of those points.
The dimension of the iris is not constant due to rotation of the camera or eyes, change
in capturing distance, etc. The iris is transformed into polar form [55] to get the same
252 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
dimension for all the images, such that two images of the same iris captured under
various conditions will have characteristic features at the same spatial location.
3.2. Super-Resolution
The single iris polar image is super-resolved using a convolutional neural network as
discussed in [36]. The overview of the convolution neural network is as shown in
Figure 4.
The super resolution of iris polar image is carried out in following steps.
a) First layer: This layer is called as patch Extraction and representation layer.
The popular strategy for image reconstruction is by extracting patches and
representing them by set of pre-trained bases. This approach is similar to
convolving the image by set of filters, each of which is In this layer,
overlapped patches of images are extracted and represented by a set of pre-
trained discrete cosine transform bases. Let be the LR iris polar image. It is
up-sampled to required size by using Bicubic interpolation approach to get .
The goal is to recover from and image ሺሻ which is as similar as possible
to the ground truth high-resolution image .This layer is represented as
The stochastic gradient descent with the standard back propagation [45] is used to
minimize the loss.
In the process of iris recognition, feature extraction and feature selection play a major
role in enhancing the recognition accuracy. This part proposes a feature extraction
method using discrete cosine transform (DCT) domain based no-reference image
quality analysis (IQA) model, Gray Level Co-occurrence Matrix (GLCM), Hu seven
moments and statistical features. It also proposes fuzzy entropy and interval-valued
fuzzy set measure based feature selection method. The block parametric model for
DCT domain feature extraction is as shown in Figure 5. The features of iris polar
images are extracted in DCT domain using the parameters of the model. The reasons
for feature extraction in DCT domain are 1) Variation in DCT coefficients occurs due
to degree and type of image distortion 2) Ease of computation. DCTs can be computed
efficiently [51] [56][57] by variable-change transforms from computationally efficient
fast Fourier transform algorithms. Applying the model-based method to coefficient
increases the computational efficiency.
Image Features
Model Based
DCT Gaussian Feature
Computation Modeling Extraction
In the first stage, the image is subjected to DCT coefficient computation. In this
stage, image is partitioned into n × n size blocks or patches. DCT coefficients are
obtained for all the patches. In the second stage, a generalized Gaussian density model
is applied to DCT coefficients of each block. The DC coefficient in a block is ignored
as it does not express structural information about the block. The generalized Gaussian
model is given by,
ȁǡ ǡ ! ൌ " ȁµ (7)
where, γ- shape parameter, µ- mean, α and β are the normalizing and scale parameters
given by
ൌ (8)
ᴦȀ
ᴦȀ
ൌ # (9)
ᴦȀ
This family of distribution includes the Gaussian distribution ( =2) and Laplacian
distribution ( =1). As →∞, the distribution converges to a uniform distribution.
Using generalized Gaussian model functions, four features of image are obtained in the
third stage. They are as shown below.
• Shape Parameter
The shape parameter (γ) model based feature is computed over all blocks in the image.
It determines the decay rate of the gamma function.
• Coefficient of Frequency Variation
Let X is a random variable representing the histogram of DCT coefficients. The
coefficient of frequency variation feature (ξ) is the ratio of the measure of the spread of
the DCT coefficient magnitudes ሺ(ȁȁ ሻ to measure of the center of the DCT coefficient
magnitude distributionሺµȁȁ ሻ. The average of coefficient of frequency variation feature
is measured.
• Energy Sub-band Ratio
This feature measures the energy distribution in lower and higher bands, which can be
affected by distortions. Sub band of 5×5 DCT coefficient matrix is divided into three
groups, as shown in the Figure 6.
)
ൌ (
Ǥ (11)
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 255
DC 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Figure 6. DCT coefficients in 3 bands
where:
Band1, Band2 and Band3
The energy sub-band ratio is large to uniform frequency content in the image patch and
is low to a small frequency disparity between the feature band and the average energy
in the lower bands. This feature is calculated for all blocks in the image and the mean
of the energy band is measured.
• Orientation Mode
Here, each block of DCT coefficients is divided into three orientation bands and
modeled. The oriented DCT coefficients are represented in three different shades as
shown in Figure 7.
DC 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
To each band in a block corresponds a generalized Gaussian model and the coefficient
of frequency variation ξ is calculated. The variance and mean of ξare calculated for
each of the three orientations.
Fourteen GLCM features [58], Hu seven moments [59] and a set of statistical
features (mean, median, standard deviation, variance, skewness, kurtosis, and entropy)
comprise the features of the iris polar images. These features are empirically selected
for the proposed work as these methods take less computation time. The GLCM
method is an approach of extracting second-order statistical texture features. These
describe the image texture by calculating rate of pair of pixels with specific value and
256 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
rate of specified spatial correlation. The amount of rows and column in this matrix are
equal to the number of gray levels + in the image. Various features extracted from the
GLCM are as below.
• Contrast
,&-.& ൌ ∑
/∑ ∑ ሺ0ǡ 1ሻ2 (13)
where 0ǡ 1 is the relative frequency with which two pixels separated by a distance
οǡ ο occur within a given region, with intensity 1 and 0.
• Inverse Difference Moment
3 ൌ ∑
∑ ሺ0ǡ 1ሻ (14)
It gives maximum value when matrix contains same elements. GLCM contrast and
homogeneity are inversely proportional to each other.
• Entropy
ൈൈǡሺ
ൈ ሻ
,--"&0, ൌ ∑
∑ (16)
ൈ
• Sum Average
9:"-5" ൌ ∑
0 ሺ0ሻ (18)
• Sum of Entropy
• Difference of Entropy
• Cluster Shade
;݄'" ൌ ∑
∑ /0 1 െ 6 െ 6 2 ൈ ሺ0ǡ 1ሻ (21)
• Cluster Prominence
ൌ ∑
∑ /0 1 െ 6 െ 6 2 ൈ ሺ0ǡ 1ሻ (22)
• Autocorrelation
• Dissimilarity
ሺǡሻ
3 ൌ ∑
∑ (26)
ȁȁ Ȁ
The fuzzy entropy and interval-valued feature selection method to select the best
feature which increases the recognition performance is proposed. The algorithms for
feature selection based on fuzzy entropy and interval-valued are discussed [51] [60]
[61]. Entropy measures the amount of disorder or uncertainty of random variables. The
major step in using entropy in various cases is the discretization of the constant range
of variables. As a result, there will be several observations, which are placed close to
the boundaries between two adjacent intervals [62] [63]. This results in an erroneous
selection in the discretization process and the performance of the selected feature set
decreases. Fuzzy entropy and interval-valued fuzzy sets can be applied to overcome
this issue; as an effect, observations close to the boundaries among intervals will be
categorized as belonging to both intervals and increase in the information extracted
from the chosen features.
258 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
Ǥ#ሺ
ሻ
ͲǤͷ 0
%&'
8
$
9"
ൌ? $
0 8
8 (27)
$ $
Ͳ
C9 ൌ ∑
െ ,5 (31)
The entropy of the feature is calculated with respect to outcome variable, after
determining the best possible number of fuzzy numbers for each feature, include that
feature in the final feature subset. New features are appended to this set one by one in
the succeeding steps of the algorithm. The new optimal feature is determined by
calculating the joint entropy of a subset of features as well as the previously chosen
features and the one which is not selected yet. The following step describes joint
entropy calculation when selecting the selecting the second feature.
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 259
%&'0%+,1 ǡ 2ǡ%+,3 ǡ 45
ǡ/ .ǡ & ൌ ∑
(32)
%+,0%+,1 ǡ 2ǡ%+,3 ǡ 45
ǡ/ ǡ/
C96 ǡ 9/ ൌ ∑
െ .ǡ &,5 .ǡ & (33)
C ǡ / ൌ ∑6
∑
C96 ǡ 9/ (34)
The pair of features having optimal entropy will be selected as the best pair. If the
improvement in the gained information is more than a predefined value, then the
features are included.
3.6. Recognition
The extracted iris IQA features, GLCM features and Hu moments are feed to a Neural
Network (NN) to recognize the iris image. Back propagation is widely applied neural
network architecture as the capability of back propagation networks to learn complex
multidimensional mapping [51]. This architecture consists of fully interconnected
layers or rows of processing units [65]. The features are the input signal for the neural
network. Here,
ǡ
ǡ
ǥ
are input feature values and ǡ ǡ ǥ
are the output
iris patterns. The NN model for l value is
7 ൌ 7 D∑/ / /
7 Ǥ E∑ Ǥ F∑ Ǥ
GHI (35)
where f- is the activation function, 7 is the weight between output and the second
hidden layers, are the weights between hidden layers, are the weights between
the input and first hidden layers. l= 1, 2, 3,….n. h1 and h2 are the number of neurons in
the first and the second hidden layers, n is the number of neurons in the output layer, m
is the number of neurons in input layer, as shown in Figure8.
ܦ ܥ
ܹ
x1 y1
x2
y2
xm yn
The training of the parameters of NN starts only after the activation of the neural
network and this trained network is used for the iris recognition. In the proposed work,
a gradient-based learning algorithm with adaptive learning rate is adopted.
4. Results
Figure 9. Sample images of LR and corresponding super-resolved iris images using the proposed approach
8
8
8
;
ǡ ൌ D 8 I D I D I (36)
8
8
where, 6 and 6 are means of patches
and . ( and ( are standard deviations of
patches
and . , and are positive constants to prevent the statistical
unsteadiness that may take place when division with denominators. ( is the cross
correlation of
and .
( ൌ ∑9
െ 6 ሺ െ 6 ሻ (37)
9
where
and are pixel intensities in image patches
and and J is total number of
pixels in
and patches.
The SSIM score of the image is obtained by taking the average value of SSIM of the
patches in the image. The range of SSIM value is between 0 to 1, where better image
quality corresponds to high structural similarity value.
Peak Signal-to-Noise Ratio (PSNR)
PSNR is the ratio between the highest possible pixel value in the image and the noise
power. It is expressed in terms of the logarithmic decibel (dB), is given by
; *:; ൌ ͳͲ ,5 (38)
<=
where ʹ
െ ͳ is the square of the highest intensity value in the image, and n is the
number of bits per image sample.
The mean square Error (MSE) is expressed as
;) ൌ ∑< >
∑ F െ * G (39)
< >
where is the height and is the width of the image. &* is the original image and &* j
is the distorted image.
The proposed system is analyzed by comparing its output with the results of
the state-of-the-art algorithms from Nguyen [43] and Anand [51] for iris polar images
as shown in Figure 10.
262 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
38 37.42
Average PSNR in dB
37 36.15
36
35
34 33.59
33
32
31
Nguyen Anand Proposed
(a)
0.94
0.918
Average SSIM
0.92 0.906
0.9
0.873
0.88
0.86
0.84
Nguyen Anand Proposed
(b)
0.92 0.905
0.9
0.88
Average VIFP
0.859
0.86
0.84 0.827
0.82
0.8
0.78
Nguyen Anand Proposed
(c)
Figure 10. Performance analysis of proposed method for upsampling factors 2
(a) average PSNR (b) average SSIM (c) average VIFP
From the Figure 10, it can be seen that the proposed system gives better image quality
compared to the state-of-the-art methods. The proposed system is analyzed for up-
sampling factors 2 and 4 on Set3. The obtained resolution for up-sampling factors 2 and
4 is 300×40 and 600×80 respectively. The graphical representation of the average
PSNR, average SSIM and average VIFP for up-sampling factors 2 and 4 are as shown
in Figure 11 and 12 respectively.
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 263
36
34.74
35
Average PSNR in dB
34
33 32.55
32
31 30.29
30
29
28
Nguyen Anand Proposed
(a)
0.92 0.911
0.91
0.9 0.896
Average SSIM
0.89
0.88
0.87
0.859
0.86
0.85
0.84
0.83
Nguyen Anand Proposed
(b)
0.84
0.831
0.83
0.82 0.816
Average VIFP
0.81
0.8 0.793
0.79
0.78
0.77
Nguyen Anand Proposed
(c)
Figure 11.Performance analysis of proposed method for up-sampling factors = 2 of Set3 images (a) average
PSNR (b) average SSIM (c) average VIFP
264 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
33 32.63
32.5
32
31.5
Average PSNR
31
30.5 30.18
30 29.73
29.5
29
28.5
28
Nguyen Anand Proposed
(a)
0.92
0.894
0.9
0.88
Average SSIM
0.86
0.835
0.84
0.82 0.804
0.8
0.78
0.76
0.74
Nguyen Anand Proposed
(b)
0.82 0.811
0.8 0.793
0.78
Average VIFP
0.76
0.74
0.722
0.72
0.7
0.68
0.66
Nguyen Anand Proposed
(c)
Figure 12. Performance analysis of proposed method for upsampling factors = 4 Set3 images (a) average
PSNR (b) average SSIM (c) average VIFP
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 265
The Figure 11 and 12 show that, as the upsampling factor increases the average PSNR,
average SSIM and average VIFP of proposed method reduces, but compared to state-
of-the-art methods proposed method gives a good quality image.
36 33.91
33.03
31.55
32 33.62 29.82
32.49 27.64
PSNR in dB
31.2 30.65
28
29.49 Nguyen
28.13 27.62
24 Anand
25.55
20 21.74 Proposed
19.58
16
5 10 15 20 25
% of Noise
(a)
0.95 0.922 0.914 0.905
0.9 0.881
0.913 0.839
0.85 0.902
0.853 0.883
0.8 0.852 0.826
SSIM
(b)
266 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
0.95 0.915
0.883
0.9 0.849
0.85 0.896 0.811
0.8 0.854 0.758
VIFp
0.75 0.806 0.805
0.781 0.793 Nguyen
0.7 0.747
0.65 0.703 Anand
0.6 0.659 0.622 Proposed
0.55
0.5
5 10 15 20 25
% of Noise
(c)
Figure 13. Image analysis under Salt-and –Pepper noise condition (a) % of Noise vs. PSNR (b) % of
Noise vs. SSIM (c) % of Noise vs. VIFP
From Figure 13, it can be seen that the proposed method performance better
reconstructions in noisy environment than other algorithms.
In this experiment, 100 different persons and 5 images of each person have been
used to train the NN classifier. To test the performance of the proposed system, we
have used two sets of data 1) 50 known persons and each person with 2 unknown
person images 2) 20 unknown individuals and each person with 5 images. The
performance of the proposed system is improved by fusing a proper number of images.
The Comparison of Receiver Operating Characteristic (ROC) curves for the proposed
system and state-of-the-art-algorithms using images from Set 2 and Set3 are as shown in
Figure 14, 15 and 16.
100
No SR(EER=1.74%)
98 Nguyen (EER=0.35%)
Anand (EER=0.21%)
96
Proposed Approach(EER=0.14%)
94
Verification Rate (%)
92
90
88
86
84
82
80
-3 -2 -1 0
10 10 10 10
False Accept Rate
Figure 14. Comparison of the ROC curves of the proposed system with state-of-the-art-algorithms for
Set2 images
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 267
100
No SR(EER=3.155%)
98 Nguyen (EER=1.385%)
Anand (EER=0.505%)
96
Proposed Approach (EER=0.453%)
94
Verification Rate (%)
92
90
88
86
84
82
80
-3 -2 -1 0
10 10 10 10
False Accept Rate
Figure 15. Comparison of the ROC curves of the proposed system with state-of-the-art-algorithms for
Set3 images, upsampling 2
100
No SR(EER=6.03%)
Nguyen (EER=2.23%)
95 Anand (EER=1.06%)
Proposed method (EER=0.83%)
90
Verification Rate (%)
85
80
75
70
-3 -2 -1 0
10 10 10 10
False Accept Rate
Figure 16. Comparison of the ROC curves of the proposed system with state-of-the-art-algorithms for
Set3 images upsampling 4
From the Figure 14, 15 and 16, it can be seen that, the EER of the proposed approach
is dropped for both Set2 and Set3 compared to existing state-of-the-art algorithms.
This shows that the performance of iris recognition with the proposed approach is
better than existing methods. The average time required to super resolve the iris
image is as shown in Table 2.
Table 2. Time analysis
Up-
Image Set Sampling Anand Proposed System
Factor
S2 2 3.21 4.18
S3 2 2.45 3.53
S3 4 1.50 3.11
From the Table 2, it can be seen that the proposed method takes more time compared
to other method. But, the quality of super-resolved image and recognition accuracy is
far better than other approaches.
268 A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image
5. Conclusion
The aim of this chapter is to recognize iris images captured at a long distance. In this
chapter, we have developed a novel framework to super-resolve the long-range
captured iris images based on deep convolutional neural network method. From the
analyses, it can be seen that the proposed framework using deep convolutional neural
network can super-resolve the iris images effectively and gives better recognition
performance compared to state-of-art algorithms. In future, this work can be extended
to recognize the human iris from a sequence of iris images or video. Further, work can
also be extended to analyze the performance of system using handcrafted features like
SIFT, SRURF, etc.
References
[1] “F. Bashir, D. Usher, P. Casaverde, and M. Friedman, “Video surveillance for biometrics: long-range
multi-biometric system,” 2008 IEEE Fifth International Conference on Advanced Video and Signal
Based Surveillance, (2008),pp. 175 -82.”
[2] “C. Fancourt, L. Bogoni, K. Hanna, Y. Guo, R. Wildes, N. Takahashi, and U. Jain, “Iris recognition at a
distance,” Lecture Notes in Computer Science, vol. 3546, (2005),pp. 1-13. ”
[3] “Subhasis Chaudhuri, “Super Resolution Imaging,” Kluwer Academic Publishers, 2002, pp.1-44.
[4] “Sung Cheol Park, Min KyuPark,and Moon Gi Kang, “Super-Resolution Image Reconstruction: A
“Technical Overview,” IEEE Signal Processing Magazine May 2003. ”
[5] “Jianchoa Yang,Thomas Huang, “ Image super resolution: Historical overview and future challenges”.
www.ifp.illinois.edu/~jyang29/papers/chap1.pdf - United States. ”
[6] “J. Allebach and P. W. Wong, “Edge-directed interpolation,” IEEE International Conference on Image
Processing, (1996), pp. 707–710. ” ”
[7] “F. N. Fritsch and R. E. Carlson, “Monotone piecewise cubic interpolation,” SIAM Journal on Numerical
Analysis, vol. 17, no. 2, (1980),pp. 238–246. ”
[8] “J. Sun, J. Sun, Z. Xu, and H. Y. Shum, “Image super-resolution using gradient profile prior,” IEEE
Conference on Computer Vision and Pattern Recognition, 2008. ”
[9] “W. S. Tam, C. W. Kok, and W. C. Siu, “Modified edge-directed interpolation for images,” Journal of
Electronic Imaging, vol. 19, no. 1, (2010), pp. 1–20. ”
[10]“L. Wang, et. al., “Edge-directed single-image super-resolution via adaptive gradient magnitude self
interpolation,” IEEE Trans. Circuits Syst. Video Technol., (2013) ”
[11]“M.Bertero,P.Boccacci, “Introduction to Inverse Problems in Imaging. (1998)
[12]“K. Zhang, et al: 'Single image super-resolution with non-local means and steering kernel “regression',
IEEE Transaction on Image Processing, 2012”
[13]“X. Gao, et al.: 'Zernike-moment-based image super-resolution’, IEEE Transaction on Image
“Processing, 2011”
[14]“Irani, M., Peleg, S.: Improving resolution by image registration. CVGIP Graph. Models Image Process.
53, 1991, pp.231–239. ”
[15]“Rajan, Subhasis Chaudhuri, “An MRF-Based Approach to Generation of Super- Resolution Images
from Blurred Observations,” Journal of Mathematical Imaging and Vision, Volume 16, Issue 1,(2002)
pp 5–15”
[16]“Manjunath V. Joshi, Subhasis Chaudhuri and Rajkiran Panuganti, “A Learning-Based Method for Image
Super-Resolution from Zoomed Observations,” IEEE Transactions On Systems, Man, And
Cybernetics—Part B: Cybernetics, Vol. 35, No. 3, (2005)”
[17]“M. V. Joshi, S. Chaudhuri, and P. Rajkiran, “Super-resolution imaging:use of zoom as a cue,” Image
and Vision Computing, vol. 14, no. 22, (2004), pp. 1185-1196.
[18]“Wen ChenXiang-zhong FangYan Cheng, “A maximum a posteriori super resolution algorithm based on
multidimensional Lorentzian distribution,” Springer Journal of Zhejiang University-SCIENCE, vol. 10,
issue 12, (2009), pp. 1705–1713. ”
A. Deshpande and P.P. Patavardhan / Super-Resolution of Long Range Captured Iris Image 269
[46]“Nadia Othman, Nesma Houmani and Bernadette Dorizzi, “Improving Video-based Iris Recognition via
Local Quality Weighted Super Resolution,” International Conference on Pattern Recognition
Applications and Methods, (2013). ”
[47]Anand Deshpande, Prashant Patavardhan, “Gaussian Process Regression Based Iris Polar Image Super
Resolution,” IEEE International Conference on Applied and Theoretical Computing and Communication
Technology, (2016).
[48]H. He and W.-C.Siu, “Single image super-resolution using Gaussian process regression,” in IEEE
conference proceedings on Pattern Recognition, pp. 449-456, (2011).
[49]Anand Deshpande, Prashant Patavardhan, “Single Frame Super Resolution of Non-cooperative Iris
Images,” ICTACT Journal on Image and Video Processing, Volume 7, Issue 2, (2016), pp.1362-1365.
[50]Anand Deshpande, Prashant Patavardhan, “Multi-frame super-resolution for long range captured iris
polar image,” IET Biometrics, Vol. No.-6, Issue-2, (2016), pp.108-116. ”
[51]Anand Deshpande, Prashant Patavardhan, “Super resolution and recognition of long range captured
multi-frame iris images,” IET Biometrics, (2017).
[52]“U. Park, et. al., “Periocular biometrics in the visible spectrum,” IEEE Transaction. Inf. Forens. Security,
(2011). ”
[53]“P. Viola, “Rapid object detection using a boosted cascade of simple features,” IEEE Conference on
Computer Vision and Pattern Recognition, (2001). ”
[54]“Nitin K. Mahadeo, Andrew P. Paplinski, Sid Ray, “Automated Selection of Optimal Frames in NIR Iris
Videos,” IEEE International Conference on Digital Image Computing: Techniques and Applications,
(2013)”
[55]“J. Daugman, “How iris recognition works,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 14, no. 1, pp. 21-30, (2004). ”
[56]“J. Huan, M. Parris, J. Lee, and R. F. DeMara, “Scalable FPGA-based architecture for DCT computation
using dynamic partial reconfiguration,” ACM Trans. Embedded Comput. Syst., vol. 9, no. 1, pp. 1–18,
(2009). ”
[57]“M. Haque, “A 2-D fast cosine transform,” IEEE Trans. Acoust. Speech Signal Process., vol. 33, no. 6,
pp. 1532–1539, Dec. 1985. ”
[58]“R. M. Haralick, K. Shanmugam and I. Dinstein, “Textural features for Image Classification”, IEEE
Transactions on Systems, Man and Cybernetics, Vol.3, pp. 610-621, (1973).
[59]“Hu M., “Visual pattern recognition by moment invariants,” IRE Transaction on Information Theory,
179–187, (1962). ”
[60]“Christer Carlsson et.al., “Fuzzy Entropy Used for Predictive Analytics,” IEEE International Conference
on Fuzzy Systems, (2015). ”
[61]“J.-D. Shie and S.-M. Chen, “Feature subset selection based on fuzzy entropy measures for handling
classification problems,” Appl. Intell., vol. 28, no. 1, pp. 69–82, (2008).
[62]“J. Mezei, J. A. Morente-Molinera, and C. Carlsson, “Feature selection with fuzzy entropy to find similar
cases,” in Advance Trends in Soft Computing. pp. 383–390, Springer,( 2014).
[63]“K. Nozaki, H. Ishibuichi, and T. Hideo, “Adaptive fuzzy rule-based classification systems,” IEEE
Transactions on Fuzzy Systems, vol. 4, no. 3, pp. 238-250, (1996).
[64]“Szmidt, E., Kacprzyk, J., “Entropy for intuitionistic fuzzy sets,” Fuzzy Sets and Systems 118, 467–477,
(2001). ”
[65]“R. H. Abiyev and K. Altunkaya “Neural Network Based Biometric personal Identification with Fast Iris
Segmentation”, Journal of control, Automation, and systems, 2009.
[66]“CASIA Iris Image Database [Online]. Available: http://biometrics.idealtest.org/. Last Accessed July
(2014). ”
[67]“Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to
structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4,, (2004),pp. 600-612. ” ”
[68]“Sheikh,H., Bovik, A., De Veciana, G. “An information fidelity criterion for image quality assessment
using natural scene statistics”,. IEEE Transaction on Image Processing, ,vol.14, no. 12, (2005),pp. 2117–
28,. ”
[69]“Lukes, T., Fliegel, K., Klima, M.: “Performance evaluation of image quality metrics with respect to
their use for super-resolution enhancement”, Qualcomm Multimedia. Exp., (2013), pp. 1-5
[70]“Zhou, X., Bhanu, B., “Evaluating the quality of super-resolved images for face recognition”, IEEE
Computer Society Conference on Computer Vision and Pattern Recognition Workshops, (2008). ”
Deep Learning for Image Processing Applications 271
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
Subject Index
artificial intelligence (AI) 1, 27 hierarchical object detection 164
automatic disease detection 94 image classification 27
automatic object detection 218 image fusion 218
bag of words 137 image indexing 164
big data 177 image processing (IP) 1, 244
biometrics 244 image processing and computer
challenges 177 vision 137
classification 194 information storage and
computational intelligence 194 retrieval 137, 164
computer vision 244 iris recognition 244
content analysis and indexing 137 kinetic 50
content based image retrieval machine 1
(CBIR) 68 machine learning 27
content-based analysis 164 mind 1
convolution neural network 50 neural networks 27
convolutional neural network with principal component analysis 111
layer skipping (CNNLS) 194 radial basis function 94
convolutional neural reinforcement learning 164
network(s) 27, 137, 164, 244 relevance feedback 68
deep belief network 50 restricted boltzmann machine 50
deep data analytics 177 security 218
deep face 218 semantic gap 68
deep learning 27, 50, 137, 177, sketch-based image retrieval
194, 218, 244 (SBIR) 68
deep learning convolutional neural SoftMax function 94
network 94 sparse filtering 194
deep multi-channel fusion 218 spatio-temporal feature 50
differential evolution extreme sputum smear images 111
learning machines 111 stevia 94
face recognition 218 super resolution 244
feature detection 94 support vector machine 111
feature representation 137, 164 surveillance 218
feature selection 68 tuberculosis 111
gray level co-occurrence matrix 111 vehicle type recognition 194
hand gesture recognition 50 virtual reality 50
This page intentionally left blank
Deep Learning for Image Processing Applications 273
D.J. Hemanth and V.V. Estrela (Eds.)
IOS Press, 2017
© 2017 The authors and IOS Press. All rights reserved.
Author Index
Agrawal, S. 68 Marqués, F. 137, 164
Awang, S. 194 Martin Sagayam, K. 50
Bellver Bueno, M. 164 McGuinness, K. 137
Bhatia, M. 177 Mittal, M. 177
Deshpande, A. 244 Mohedano, E. 137
Estrela, V.V. v Nik Azmi, N.M.A. 194
Giró-i-Nieto, X. 137, 164 O’connor, N.E. 137
Govindaraju, S. 27 Pandey, S.C. 1
Hemanth, D.J. v Patavardhan, P.P. 244
Henesey, L.E. 50 Pratap Singh, U. 68
Ho, C.C. 50 Priya, E. 111
Jain, S. 68 Salvador, A. 137
John, V. 218 Sivakumar, R. 94
Kumar Singh, R. 68 Srinivasan, S. 111
Lakshmi, S. 94 Torres, J. 164
Liu, S. 218 Vasuki, A. 27
Liu, Z. 218 Vedha Viyas, T. 50
Madhurima 177
This page intentionally left blank