Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Human Action Recognition Report

The document discusses facial emotion detection using deep learning. It describes the process of emotion detection including face detection, feature extraction, and emotion classification. A CNN-based deep learning architecture is proposed for emotion detection from images. The system can also recognize walking based on detected facial emotions.

Uploaded by

balajimt922
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Human Action Recognition Report

The document discusses facial emotion detection using deep learning. It describes the process of emotion detection including face detection, feature extraction, and emotion classification. A CNN-based deep learning architecture is proposed for emotion detection from images. The system can also recognize walking based on detected facial emotions.

Uploaded by

balajimt922
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

ABSTRACT

Facial recognition system is a technology capable of matching a human face from a


digital image or a video frame against the database of faces. Automatic facial recognition is a
widely used task in the field of computer vision which is very easy for a human but very
challenging for computers.
Facial recognition detects the facial emotions such as happy, sad, surprise, angry, fear,
crying, disgust, laughing, yawning. Human Emotion detection from image is one of the most
powerful and challenging research task in social communication. Deep learning (DL) based
emotion detection gives performance better than traditional methods with image processing.
This paper presents the design of an artificial intelligence (AI) system capable of emotion
detection through facial expressions. It discusses about the procedure of emotion detection,
which includes basically three main steps: face detection, features extraction, and emotion
classification. This paper proposed a convolutional neural networks (CNN) based deep
learning architecture for emotion detection from images.
Based on the face emotion detection walking will be recognized in the person.

i
CHAPTER 1
INTRODUCTION
Recently, automation system has drawn much attention for industrial as well as
academic research and Human Activity Recognition (HAR) system is one of them in the field
of computer vision analysis technology . Due to the mounting demands from numerous
applications like in medical or healthcare systems, security, visual monitoring, video
acquisition, entertainmentas well as abnormal activity monitoring to capture glimpses of what
is going on as well as thedetection of illegal or possibly harmful practises. Similarly, inthe area
of entertainment, HAR also helps to improve the performance of Human Computer Interaction
(HCI), Furthermore, HAR also play a vital role in a healthcare system to recognize the
activityof the rehabilitation of patients like their action and behaviour to facilitate the
rehabilitation processes. Many scholars have attempted to use the HAR method, especially in
regards to abnormal activities in home, human activity, sports and street activity, healthcare
monitoring and many more application in their studies. In this review paper, the computer
vision-based technologies for recognition of human activity or abnormal behaviours using the
concept of demanding and computationally intelligent classification techniques like deep
learning and machine learning will be extensively reviewed and discussed with challenges and
future possibilities. An action in the HAR mechanism may be witnessed by using the human
eye or by the use of some kind of visualization or sensing technologies. The actual activity of
the individual in the field of view must be constantly monitored for the operation to be properly
performed. Based on the type of involved body parts used for motion, human actions may be
divided into four categories:
Facial and walking: It is focused on the action of a human’s face, or other body parts when
walking, with no requirement of verbal contact.
• Action: it is just a series of gestures performed by humans such as running or sitting or
walking.
• Interaction: It is also an important aspect that incorporates individual actions to be
executed by human. Interaction can be with individual or single person.
• Group Activity: It may be a mix of human movements, behaviours, acts, or
interactions. Here, number of performers can be performed at a time but at least two or more
objects or person needed for the interaction.

1
Facial and walking recognition using openCV Introduction

HAR has become very useful in current applications which includes complex content-based
video processing and video search, studies on visual monitoring, search interfaces, schooling,
and health care. Further discussions on these applications of HAR system are provided in Section
4 of review paper and now we discussed the overview of HAR system. Generally, In HAR
system, the action recognition task can be shown by action interpretations and action
representation performed by the human [5]. Each type of equipment, such as cameras, sensors
such as an RGB detector, or sensors worn on the body, may help acquire these activities.

INTRODUCTION OF ARTIFIICAL INTELLIGENCE


Artificial Intelligence (AI) is a branch of Science which deals with helping machines find
solutions to complex problems in a more human-like fashion.
 This generally involves borrowing characteristics from human intelligence, and applying them
as algorithms in a computer friendly way.
 A more or less flexible or efficient approach can be taken depending on the requirements
established, which influences how artificial the intelligent behavior appears.

Figure1.1: View of Artificial Intelligence

Importance of AI
 Game Playing You can buy machines that can play master level chess for a few hundred
dollars. There is some AI in them, but they play well against people mainly through brute force
computation--looking at hundreds of thousands of positions. To beat a world champion by brute

2 2022-2023
Facial and walking recognition using openCV Introduction

force and known reliable heuristics requires being able to look at 200 million positions per
second.
 Speech Recognition In the 1990s, computer speech recognition reached a practical level for
limited purposes. Thus United Airlines has replaced its keyboard tree for flight information by a
system using speech recognition of flight numbers and city names. It is quite convenient. On the
other hand, while it is possible to instruct some computers using speech, most users have gone
back to the keyboard and the mouse as still more convenient.
 Understanding Natural Language Just getting a sequence of words into a computer is not
enough. Parsing sentences is not enough either. The computer has to be provided with an
understanding of the domain the text is about, and this is presently possible only for very limited
domains.
 Computer Vision The world is composed of three-dimensional objects, but the inputs to the
human eye and computers' TV cameras are two dimensional. Some useful programs can work
solely in two dimensions, but full computer vision requires partial three-dimensional information
that is not just a set of two-dimensional views. At present there are only limited ways of
representing three-dimensional information directly, and they are not as good as what humans
evidently use.
PROBLEM DEFINITION

Many researchers have contributed innovative algorithms and approaches in the area of
human action recognition system and have conducted experiments on individual data sets by
considering accuracy and computation. In spite of their efforts, this field requires high accuracy
with less computational complexity. The existing techniques are inadequate in accuracy due to
assumptions regarding clothing style, view angle and environment. Hence, the main objective of
this thesis is to develop an efficient multi-view based human action recognition system using
shape features. During the development phase, the following two objectives have been conceived
in the proposed approach: Primary Objective – to develop an efficient human action recognition
system using multiple views. Secondary Objective – to understand human behavior model using
probabilistic action graph.

3 2022-2023
Facial and walking recognition using openCV Introduction

SCOPE OF THE PROJECT


Human action acknowledgment intends to distinguish the activities did by an individual
susceptible to him which are done as often as possible in his everyday existence as for the general
condition.
Acknowledgment can be practiced by abusing the data recovered from different sources,
for example, condition by utilizing sensors. In this task we gather dataset from 30 volunteers
going from 19-48 years and we process the information utilizing calculation. Human action
acknowledgment assumes a noteworthy job in human-to-human communication and relational
relations.
MOTIVATION
Understanding the human activity and their interaction with the surrounding objects is a
key element for the development of intelligent system. Human action recognition is a field that
deals with the problems generates in the integration of sensing and reasoning, to provide context
aware data that can be confer the personalized support across an application. In the human action
recognition system, still there are various issues which need to be addressed like as battery
limitation of wearable sensors, privacy concern regarding continuous monitoring of activities,
difficulty in performing HAR (Human activity recognition) in real time and lack of fully ambient
systems able to reach users at any time.

OBJECTIVE
Detecting human beings accurately in a visual surveillance system is crucial for diverse
application areas including abnormal event detection, human gait characterization, congestion
analysis, person identification, gender classification and fall detection for elderly people. The
biggest use of such system is during the ongoing Worldwide Covid-19 Pandemic where
intelligent tracking of mass gathering is of utmost importance to avoid the community spread of
the disease. As the system comes with Real-time human counting, based on the statistics the task
of the governing body gets reduce significantly to identify crowdy places or streets with the help
of the proposed system respectively.

4 2021-2022
Facial and walking recognition using openCV Introduction

ORGANIZATION OF THE REPORT


Chapter 1: The chapter 1 describes, in brief, the idea of the project. It begins with the
explanation of the purpose of the project, the definitions of a few terms used in the
document,the problem definition, motivation and the scope of the project.

Chapter 2: This chapter describes the literature survey and the background
preparation done to understand more about this project.

Chapter 3: It describes the system requirements such as hardware and software


requirements functional and non-functional requirements, preliminary investigations,
system environment.

Chapter 4: This chapter describes the existing system and its limitations and how it
tries to improve the existing system. It explains the proposed system and its
architecture.

Finally, this project explains the problem in hand for the system that is being designed.

5 2021-2022
CHAPTER 2
LITERATURESURVEY

INTRODUCTION

A literature survey or a literature review in a project report is that section which shows
the various analyses and research made in the field of your interest and the results already
published, taking into account the various parameters of the project and the extent of the
project. It is the most important part of your report as it gives you a direction in the area of
your research. It helps you set a goal for your analysis – thus giving you your problem
statement.
Literature survey is for the most part done with a specific end goal to break down the
foundation of the present venture which discovers imperfection in the current framework and
aids on which unsolved issues we can work out. Along these lines, the accompanying points
represent the foundation of the venture as well as reveal the issues and which encourages to
purpose arrangements and work on the current issues. An assortment of examination has been
done on fault resilience for application. Taking after segment investigates diverse references
that examine around a few subjects identified with fault tolerance.

RELATED WORK

Reference[1]:
Abstract: Human Action Recognition is an imperative research area in the field of computer
vision due to its numerous applications such as person surveillance, human to object interaction,
etc. Human Action Recognition is based on a pre-trained CNN model for feature extraction.
Convolutional neural networks (CNN) is a technique of deep learning. Most convolutional neural
networks used for recognition task are built using convolution and pooling layers followed by a
few number of fully connected layers and identifying similar patterns in an interval to recognize
the action by providing accuracy of 79-90% based on the task.

6
Facial and walking recognition using openCV Literature Survey

Reference[2]:
Abstract:Nowadays the deluge of data is increasing with new technologies coming up
daily. These advancements in recent times have also led to an increased growth in fields like
Robotics and Internet of Things (IoT). This paper helps us draw a comparison between the
usage and accuracy of different Human Activity Recognition models. There will be
discussion on mainly two models- 2-D Convolutional Neural Network and Long-Short term
Memory. In order to maintain the consistency and credibility of the survey, both models are
trained using the same dataset containing information collected using wearable sensors
which was acquired from a public website. They are compared using their accuracy and
confusion matrix to check the true and false positives and later the various aspects and
fields, where the two models can separately and together be used in the wider field of
Human Activity Recognition using image data have been explained. The experimental
results signified that both Convolutional Neural Networks and Long-Short term memory
model are equally equipped for different situations, yet Long-Short Term memory model
mostly appears to be more consistent than Convolutional Neural Networks.

Reference[3]:
Abstract: Body activity recognition using wearable sensor technology has drawn more and
more attentions over the past few decades. The complexity and variety of body activities
makes it difficult to fast, accurately and automatically recognize body activities. To solve
this problem, this paper formulates body activity recognition problem as a classification
problem using data collected by wearable sensors. And three different machine learning
algorithms, support vector machine, hidden markov model and artificial neural network are
presented to recognize different body activities. Various numerical experiments on a
realworld wearable sensors dataset are designed to verify the effectiveness of these
classification algorithms. Finally the results demonstrate that all the three algorithms
achieve satisfactory activity recognition performance.

7 2021-2022
Facial and walking recognition using openCV Literature Survey

Reference[4]:
Abstract:Near infrared-visible (NIR-VIS) heterogeneous face recognition refers to the
process of matching NIR to VIS face images. Current heterogeneous methods try to extend
VIS face recognition methods to the NIR spectrum by synthesizing VIS images from NIR
images. However, due to the self-occlusion and sensing gap, NIR face images lose some
visible lighting contents so that they are always incomplete compared to VIS face images.
This paper models high-resolution heterogeneous face synthesis as a complementary
combination of two components: a texture inpainting component and a pose correction
component. The inpainting component synthesizes and inpaints VIS image textures from
NIR image textures. The correction component maps any pose in NIR images to a frontal
pose in VIS images, resulting in paired NIR and VIS textures. A warping procedure is
developed to integrate the two components into an end-to-end deep network. A fine-
grained discriminator and a wavelet-based discriminator are designed to improve visual
quality. A novel 3D-based pose correction loss, two adversarial losses, and a pixel loss are
imposed to ensure synthesis results. We demonstrate that by attaching the correction
component, we can simplify heterogeneous face synthesis from one-to-many unpaired
image translation to one-to-one paired image translation, and minimize the spectral and
pose discrepancy during heterogeneous recognition. Extensive experimental results show
that our network not only generates high-resolution VIS face images but also facilitates the
accuracy improvement of heterogeneous face recognition.

Reference[5]:
Abstract:The way people look in terms of facial attributes (ethnicity, hair color, facial
hair, etc.) and the clothes or accessories they wear (sunglasses, hat, hoodies, etc.) is highly
dependent on geo-location and weather condition, respectively. This work explores, for the
first time, the use of this contextual information, as people with wearable cameras walk
across different neighborhoods of a city, in order to learn a rich feature representation for
facial attribute classification, without the costly manual annotation required by previous
methods. By tracking the faces of casual walkers on more than 40 hours of egocentric
video, we are able to cover tens of thousands of different identities and automatically

8 2021-2022
Facial and walking recognition using openCV Literature Survey

extract nearly 5 million pairs of images connected by or from different face tracks, along
with their weather and location context, under pose and lighting variations. These image
pairs are then fed into a deep network that preserves similarity of images connected by the
same track, in order to capture identity-related attribute features, and optimizes for location
and weather prediction to capture additional facial attribute features. Finally, the network is
fine-tuned with manually annotated samples. We perform an extensive experimental
analysis on wearable data and two standard benchmark datasets based on web images
(LFWA and CelebA). Our method outperforms by a large margin a network trained from
scratch. Moreover, even without using manually annotated identity labels for pre-training
as in previous methods, our approach achieves results that are better than the state of the
art.

Reference[6]:
Abstract: Automatic facial gender recognition is a widelyused task in the field of computer
vision, which is very easy for a human, but very challenging for computers. In this paper, a
face gender classification algorithm based on face recognition feature vectors is proposed.
Firstly, face detection and preprocessing are performed on the input images, and the faces are
adjusted to a unified format. Secondly, the face recognition model is used to extract feature
vectors as the representation of the face in the feature space. Finally, machine learning
methods are used to classify the extracted feature vector. Meanwhile, this study uses t-
distributed Stochastic Neighbor Embedding (T-SNE) to visualize the face recognition feature
vectors to verify the effectiveness of the face recognition feature vectors on the issue of
gender classification. The proposed method has achieved a recognition rate of 99.2% and
98.7% on the FEI dataset and the SCIEN dataset, respectively. Besides, it also achieves a
recognition rate of 97.4% on the Asian star face dataset, outperforming existing methods,
which shows that the proposed method is helpful for the research of facial gender.

9 2021-2022
Facial and walking recognition using openCV Literature Survey

Reference[7]:
Abstract: surveillancevideo systems are gaining increasingattention in the field of computer
vision due to its demands of users for the seek of security. It is promising to observe the
human movement and predict such kind of sense of movements. The need arises to develop
a surveillance system that capable to overcome the shortcoming of depending on the human
resource to stay monitoring, observing the normal and suspect event all the time without any
absent mind and to facilitate the control of huge surveillance system network. In this paper,
an intelligent human activity system recognition is developed. Series of digital image
processing techniques were used in each stage of the proposed system, such as background
subtraction, binarization, and morphological operation. A robust neural network was built
based on the human activities features database, which was extracted from the frame
sequences. Multi-layer feed forward perceptron network used to classify the activities model
in the dataset. The classification results show a high performance in all of the stages of
training, testing and validation. Finally, these results lead to achieving a promising
performance in the activity recognition rate.

Reference[8]:
Abstract: Abstract -Human Activity recognition exhibits its presence in diverse research
areas like medical organization, Survey system, security surveillance as well as human
computer interaction. This work demonstrates a robust approach of classifying six basic
human centered behaviors (Walking, Walking Upstairs, Walking Downstairs, Sitting,
Standing, and Lying implementing Logistic Regression, Logistic Regression CV and
Random Forest algorithm. Computing is an up-and-coming research to comprehend
individual actions and try to assimilate their social context. A precise demanding and
agreeable application of sensing human body motion by smart phones to collect context
information. Here, activity recognition database is considered publicly available as
repository.

10 2021-2022
Facial and walking recognition using openCV Literature Survey

Reference[9]:
Abstract: Human activity recognition (HAR) is a hot research topic which aims to
understand human behavior and can be applied in various applications. However, transitions
between activities are usually disregarded due to their low incidence and short duration
when compared against other activities, while in fact transitions can affect the performance
of the recognition system if not dealt with properly. In this paper, we propose and
implement a systematic human activity recognition method to recognize basic activities
(BA) and transitional activities (TA) in continuous sensor data stream. Firstly, raw sensor
data are segmented into fragments with sliding window and the features are constructed
based on window segmentation. Then, cluster analysis with K-Means is used to aggregate
activity fragments into periods. Next, generally realize the classification of BA and TA
according to the shortest duration of the BA, and then deal with the hidden phenomenon of
BA. Thirdly, the fragments between adjacent BA are evaluated to decide whether they are
TA or disturbance process. Finally, Random Forest classifier is used to accurately recognize
BA and TA. The proposed method is evaluated on the public dataset SBHAR. The results
demonstrate that our method effectively recognizes different activities and can deliver high
accuracy with all activities considered.

Reference[10]:
Abstract: For the past few years, smartphone based human activity recognition (HAR) has
gained much popularity due to its embedded sensors which have found various applications
in healthcare, surveillance, human-device interaction, pattern recognition etc. In this paper,
we propose a neural network model to classify human activities, which uses activity-driven
hand-crafted features. First, the neighborhood component analysis derived feature selection
is used to choose a subset of important features from the available time and frequency
domain parameters. Next, a dense neural network consisting of four hidden layers is
modeled to classify the input features into different categories. The model is evaluated on
publicly available UCI HAR data set consisting of six daily activities; our approach
achieved 95.79% classification accuracy. When compared with existing state-of-the-art
methods, our proposed model outperformed most other methods while using fewer features,
thus showing the importance of proper feature selection.

11 2021-2022
CHAPTER 3
SYSTEM REQUIREMENTS SPECIFICATION
A System Requirements Specification (SRS) is a collection of information that
embodies the requirements of the system. A Business Analyst (BA) , sometimes titled system
analyst , is responsible for analyzing the business needs of their clients and stakeholders to
help identify business problems and propose solutions. Within the system development life
cycle domain, the BA typically performs a liaison function between the business side of an
enterprise and the information technology department or external service providers.

INTRODUCTION
A System Requirements Specification is a set of documentation that describes the
features and behavior of a system or software application. It includes a variety of elements
that attempts to define the intended functionality required by the customer.In addition to
specifying how the system should behave, the specification also defines at a high-level the
main business processes that will be supported, what simplifying assumptions have been
made and what key performance parameters will need to be met by the system. Depending on
the methodology employed the level of formality and detail in the SRS will vary, but in
general an SRS should include a description of the functional requirements, non functional
requirements, software requirements and Hardware Requirements.The requirements for a
system are the descriptions of the services provided by the system and its operational
constraints. These requirements reflect the needs of customers for a system that help solve
some problem such as controlling a device, placing an order or finding information. A
requirement is simply a high level, abstract statement of a service that the system should
provide or a constraint on the system.

Functional requirements
• System should automatically detect human actions.
• System should automatically detect human in videos.
• If System difficult to recognize persons face then it will be treated as unusual activity.
• System should automatically detect any unusual things like human fall.

12
Facial and walking recognition using openCV System requriments specification

Non-functional requirements

Usability :
• Easy Interface for capture of image and alert the control room if the usual activity.
Reliability :
• Theft Avoidance.
Performance :
• Should not take excessive time in taking measure in detecting anamoly.
Supportability :
• Easy to understand code with provisions for future enhancement.

Hardware requirements
• System : intel i3/i5 2.4 GHz
• Hard Disk : 500 GB
• Ram : 4/8 GB

Software requirements
• Operating system : Windows XP/ Windows 7

• Software Tool : Open CV – Python


• Coding Language : Python.

Windows 10: Windows 10 is a series of operating systems developed by Microsoft and


released as part of its Windows NT family of operating systems. It is the successor to
Windows 8.1, released nearly two years earlier, and was released to manufacturing on July
15, 2015, and broadly released for the general public on July 29, 2015.[14] Windows 10 was
made available for download via MSDN and Technet, and as a free upgrade for retail copies
of Windows 8 and Windows RT users via the Windows Store. Windows 10 receives new
builds on an ongoing basis, which are available at no additional cost to users, in addition to
additional test builds of Windows 10, which are available to Windows Insiders. Devices in
enterprise environments can receive these updates at a slower pace, or use long-term support
milestones that only receive critical updates, such as security patches, over their ten-year
lifespan of extended support.

13 2021-2022
Facial and walking recognition using openCV System requriments specification

Python: Python was conceived in the late 1980s by Guido van Rossum at Centrum
Wiskunde & Informatica (CWI) in the Netherlands as a successor to ABC programming
language, which was inspired by SETL,capable of exception handling and interfacing with
the Amoeba operating system.[10] Its implementation began in December 1989.Van
Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July
2018, when he announced his "permanent vacation" from his responsibilities as Python's
Benevolent Dictator For Life, a title the Python community bestowed upon him to reflect his
long-term commitment as the project's chief decision-maker.In January 2019, active Python
core developers elected a 5-member "Steering Council" to lead the project.[43] As of 2021,
the current members of this council are Barry Warsaw, Brett Cannon, Carol Willing,
Thomas Wouters, and Pablo Galindo Salgado.

OpenCV-Python: OpenCV-Python is a library of Python bindings designed to solve


computer vision problems.cv2.imread() method loads an image from the specified file. If the
image cannot be read (because of missing file, improper permissions, unsupported or invalid
format) then this method returns an empty matrix.

PRELIMINARY INVESTIGATION
 “Human activity recognition using openCV.”, International journal of creative
research thoughts(IJCRT2021).
Human Action Recognition is an area computer vision research and Applications. The
goal of Human Action Recognition is to identify and understand the actions of people
in videos and export corresponding tags which can be achieved through automated
analysis or interpretation of ongoing events and their text using video data input. The
human activity recognition model was trained on Kinetic dataset which contains 400
actions. Though the retracing of Spatiotemporal 3D CNN the actions are recognized.

 “Face Gender Recognition based on Face Recognition Feature Vectors”,


International Conference on Information Systems and Computer Aided Education
(ICISCAE2020).
Firstly, face detection and preprocessing are performed on the input images, and the
faces are adjusted to a unified format. Secondly, the face recognition model is used to
extract feature vectors as the representation of the face in the feature space. Finally,

14 2021-2022
Facial and walking recognition using openCV System requriments specification

machine learning methods are used to classify the extracted feature vector. The proposed
method has achieved a recognition rate of 99.2% and 98.7% on the FEI dataset and the
SCIEN dataset, respectively. Besides, it also achieves a recognition rate of 97.4% on the
Asian star face dataset.

 “Smartphone Based Human Activity Recognition with Feature Selection and Dense
Neural Network”, International Conference on Reliability, Infocom Technologies
and Optimization(ICRITO2020)

The model is evaluated on publicly available UCI HAR data set consisting of six daily
activities; our approach achieved 95.79% classification accuracy. The goal of HAR is to
identify various postural positions and daily activities carried out by a person given a set
of observations and the surrounding environment using Convolutional neural network
(CNN).

SYSTEM ENVIRONMENT
It has C++, Python, Java and MATLAB interfaces and supports Windows, Linux,
Android and Mac OS. OpenCV leans mostly towards real-time vision applications and takes
advantage of MMX and SSE instructions when available. A full-featured CUDA and
OpenCL interfaces are being actively developed right now. There are over 500 algorithms
and about 10 times as many functions that compose or support those algorithms. OpenCV is
written natively in C++ and has a templated interface that works seamlessly with STL
containers.
In 1999, the OpenCV project was initially anIntel Researchinitiative to advanceCPUintensive
applications, part of a series of projects includingreal-timeray tracingand3D displaywalls. The
main contributors to the project included a number of optimization experts in Intel Russia, as
well as Intel‘s Performance Library Team.

15 2021-2022
CHAPTER 4
SYSTEM ANALYSIS
System Analysis as “the process of studying a procedure or business in order to
identify its goals and purpose and create system and procedures that will achieve them in an
efficient way”. Another view sees system analysis as a problem-solving technique that
breaks down a system into its component pieces for the purpose of the studying how well
those component parts work and interact to accomplish their purpose.

EXISTING SYSTEM
The most successful and popular vector-form feature: histograms of oriented gradients
(HOG) It is shown that the HOG features are based on the contrast of silhouette contours
against the background. Despite all the difficulties on human detection, a lot of work has
been done recent years. First, we may use different features such as edge, Haar features and
gradient orientation features; second, we may use different classifiers such as Nearest
Neighbor, Neutral Network, SVM and Adaboost the second step of human detection is
designing classifier. Large generalization ability and less classifying complexity are two
important criteria for selecting classifiers. Linear support vector machine (SVM) and
AdaBoost are two widely-used classifiers satisfying the criteria. So the traditional approach
of AdaBoost for face detection and has demonstrated both high recognition accuracy and fast
run-time performance. However, in most cases the classification accuracy is lower than that
of the first proposed algorithm based on HOG+ SVM.

PROPOSED SYSTEM
Methodology of Human Activity Recognition include several steps of processing
from taking the input, identifying the similar patterns, comparing the frames with the Kinetic
dataset, recognizing the actions and providing the context and speech of the action to the
video frames.

Capturing the frames:


The human actions that are being performed in a video input are divided into frames at
certain intervals of time. These frames are captured and taken as input to the CNN model to
identify similar patterns by pooling them into certain classes of actions.

16
Facial and walking recognition using openCV System analysis

Dataset:
A kinetics dataset which consists of 400 human activities is used for prediction and
comparison of the input data. Kinetics dataset are taken from youtube recordings. The
activities are human focuses and cover a wide scope of classes including human-object
communications, for example mowing lawn, washing dishes, humans Actions e.g. Since the
dataset is huge and downloading each clip would be a waste of time given that we already
have pre-trained models by the original author. It will be easy and provides accurate results
when worked on the pre-trained model than to train and tune it separately.

Recognition of the action:


The kinetic dataset is used by the ResNet_34 3D model to compare similar patterns in the
input data frames that are captured in the intervals. The similar patterns can be identified by
CNN trough pooling layer by layer. The identified actions are categorized into classes of
human activities. The recognition of the data input can be done by Resnet_34 model by video
classification of 3D kernels. Segmentation of the actions are the classes which are identified
by the model.

Context and Speech output of the action:


Through the programming in python the captioning of the activity that is identified by the
model can be displayed in the video while execution of the input file. Simultaneously, the
speech of the activity that is captioned will be produced.

17 2021-2022
CHAPTER 5
System Design:

System design is the process of the defining the architecture, components, modules, interfaces, and
data for a system to satisfy specified requirements .Systems design could be seen as the application of
systems theory to product development. Object- oriented analysis and methods are becoming the most
widely used methods for computer systems design. Systems design is therefore the process of defining
and developing systems to satisfy specified requirements of the user. The UML has become the
standard language in object oriented analysis and design.

5.1. ARCHITECTURAL DESIGN

System architecture is a conceptual model that defines the structure and behavior of the system. It
comprises of the system components and the relationship describing how they work together to
implement the overall system.
In this paper,In this project, by running the main web page it will trigger an XML file that then Opencv
helps in capturing images from the webcam as well as for processing purposes. made the
implementation of the fisher face methodology of OpenCV for classification. And fisher face to train
the model and store it in a model-file(XML). While using a player it uses for prediction for the
emotion which will show you the main media player web page. In this, it contains 2 options one for
emotion-based detection and the other for random selection of songs Rando picking we are a small
library in python i.e, Eel. On the other hand, we are having the emotion-based music system in this we
are using 3 main algorithms for capturing, detection and playing of the music.this system, describes the
facial expressions using detection and combination of spatial expressions. After Feature Extraction, the
Emotions are classified it is in 4 forms I.e, Happy, Angry, Sad and neutral face. The emotions that are
transferred to last step are in numerical form and the music is played from the emotions that are
detected. The main objective of face detection technique is to identify the frame I.e, face. And the
other phase of the project is the random mode for this we are using Eel for the random picking of
songs irregular to the queue. The win sound module is used to access the local sound-playing
machinery that is provided in the Windows platforms
Data Flow diagram:
A dataflow diagram is a graphical representation of the "flow" of data through an information system,
modeling its process aspects. A DFD is often used as a preliminary step to create an overview of the
system without going into great detail, which can later be elaborated. DFDs can also be used for the
visualization of data processing. A DFD shows what kind of information will be input to and output
from the system, how the data will advance through the system, and where the data will be stored.
Sequence Diagram:
A sequence diagram shows object interactions arranged in time sequence. It depicts the objects and
classes involved in the scenario and the sequence of messages exchanged between the objects needed
to carry out the functionality of the scenario. Sequence diagrams are typically associated with use case
realizations in the Logical View of the system under development.
Sequence diagrams are sometimes called event diagrams or event scenarios. A sequence diagram
shows, as parallel vertical lines (lifelines), different processes or objects that live simultaneously, and,
as horizontal arrows, the messages exchanged between them, in the order in which they occur. This
allows the specification of simple runtime scenarios in a graphical manner.
Use case Diagram:
A use case diagram at its simplest is a representation of a user's interaction with the system that shows
the relationship between the user and the different use cases in which the user is involved. A use case
diagram can identify the different types of users of a system and the different use cases and will often
be accompanied by other types of diagrams as well. While a use case itself might drill into a lot of
detail about every possibility, a use case diagram can help provide a higher-level view of the system. It
has been said before that "Use case diagrams are the blueprints for your system". They provide the
simplified and graphical representation of what the system must actually do.
Class Diagram:
The class diagram is the main building block of object-oriented modeling. It is used for general
conceptual modeling of the structure of the application, and for detailed modeling translating the
models into programming code. Class diagrams can also be used for data modeling.
CHAPTER 5

IMPLEMENTATION

1. Capture the video – Depending on the complexity of activity recognition and the resource
availability, the camera with appropriate resolution is used to capture the video for further processing.
2. Segmentation of the video, where the region of interest or presence of humans is detected.
3. Feature Extraction, where the required features are extracted based on the motion or the pose of the
humans.
4. Feature Representation, here the extracted features are represented using the feature vectors or
feature descriptors.
5. Finally, training and testing is done using classification model.

Algorithm Details:

A Deep-CNN is type of a DNN consists of multiple hidden layers such as convolutional layer,
RELU layer, Pooling layer and fully connected a normalized layer. CNN shares. weights in
the convolutional layer reducing the memory footprint and increases the performance of the
network. The important features of CNN lie with the 3D volumes of neurons, local
connectivity and shared weights. A feature map is produced by convolution layer through
convolution of different sub regions of the input image with a learned kernel. Then, anon-
linear activation function is applied through ReLu layer to improve the convergence properties
when the error is low. In pooling layer, a region of the image/feature map is chosen and the
pixel with maximum value among them or average values is chosen as the representative.
Fig 4.1 : Deep-Convolutional Neural Network Architecture

This results a large reduction in the sample size. Sometimes, traditional Fully- Connected
(FC) layer will be used in conjunction with the convolutional layers towards the output stage.
In CNN architecture, usually convolution layer and pool layer are used in some combination.
The pooling layer usually carries out two types of operations viz. max pooling and means
pooling. In mean pooling, the average neighborhood is calculated within the feature points
and in max pooling it is calculated within a maximum of feature points. Mean pooling reduces
the error caused by the neighborhood size limitation and retains background information. Max
pooling reduces the convolution layer parameter estimated error caused by the mean deviation
and hence retains more texture information.

The proposed method follows these stages:

Data Set:

Dataset for training is obtained from Lung Image Database Consortium (LIDC) and Image
Database Resource Initiative (IDRI). LIDC and IDRI consist of 1000 CT scans of both large
and small tumors saved in Digital Imaging and Communications in Medicine (DICOM)
format.

Image Segmentation:

The segmentation of photographs is the phase where the visual image is partitioned into
several parts. This normally helps to identify artifacts and boundaries. The aim of
segmentation is to simplify the transition in interpretation of a picture into concrete picture
that can be clearly interpreted and quickly analyzed.

Pre-Processing:

In preprocessing stage, the median filter is used to restore the image under test by
minimizing the effects of the degradations during acquisition. Various preprocessing and
segmentation techniques of lung nodules are discussed in. The median filter simply replaces
each pixel value with the median value of its neighbors including itself. Hence, the pixel
values which are very different from their neighbors will be eliminated.
Convolutional Neural Networks:

A CNN is type of a DNN consists of multiple hidden layers such as convolutional


layer, RELU layer, Pooling layer and fully connected a normalized layer. CNN shares
weights in the convolutional layer reducing the memory footprint and increases the
performance of the network. The important features of CNN lie with the 3D volumes of
neurons, local connectivity and shared weights. A feature map is produced by convolution
layer through convolution of different sub regions of the input image with a learned
kernel. Then, anon- linear activation function is applied through ReLu layer to improve the
convergence properties when the error is low. In pooling layer, a region of the
image/feature map is chosen and the pixel with maximum value among them or average
values is chosen as the representative pixel so that a 2x2 or 3x3 grid will be reduced to a
single scalar value. This results a large reduction in the sample size. Sometimes, traditional
Fully-Connected (FC) layer will be used in conjunction with the convolutional layers
towards the output stage.

A CNN is composed of several kinds of layers:

Convolutional layer: creates a feature map to predict the class probabilities for each
feature by applying a filter that scans the whole image, few pixels at a time.

Pooling layer (down-sampling): scales down the amount of information the


convolutional layer generated for each feature and maintains the most essential
information (the process of the convolutional and pooling layers usually repeats several
times).
Fully connected input layer: flattens the outputs generated by previous layers to turn them
into a single vector that can be used as an input for the next layer.

Fully connected layer: Applies weights over the input generated by the feature analysis to
predict an accurate label.

Fig 4.2 : Convolutional Neural Network General Architecture


Training:

Back-propagation algorithm used to train the Deep CNN to detect lung tumors in CT image of size
5×20×20. It consists of two phases. In the first phase, a CNN consists of multiple volumetric convolution,
rectified linear units (ReLU) and max pooling layers is used to extract valuable volumetric features from
input data. The second phase is the classifier. It has multipleFC and threshold layers, followed by a
SoftMax layer to perform the high-level reasoning of the neural network. No scaling was applied to the
CT images of the dataset to preserve the original values of the DICOM images as much as possible.
During training, the randomsub-volumes extracted from the CT images of the training set and are
normalized according to an estimate of the normal distribution of the voxel valuesin thedataset.

Testing and Results:

The neural network based on convolutional and watershed segmentation has been implemented in Python
and the system is trained with sample data sets for the model to understand and familiarize the lung
cancer. A sample image has been fed as an input to the trained model and the model at this stage is able
to tell the presence of cancer and locate the cancer spot in the sample image of a lung cancer. The
process involves the feeding the input image, preprocessing, feature extraction, identifying the cancer
spot and indicate the results to the user. In case of the malignancy is present, a message indicating the
presence of will be displayed on thescreen.
SOFTWARE:

OpenCV

OpenCV is a library of programming functions mainly aimed at real-time computer vision. It has a modular
structure, which means that the package includes several shared or static libraries. We are using image processing
module that includes linear and non-linear image filtering, geometrical image transformations (resize, affine and
perspective warping, and generic table-based remapping), color space conversion, histograms, and so on. Our
project includes libraries such as Viola-Jones or Haar classifier, LBPH (Lower Binary Pattern histogram) face
recognizer, Histogram of oriented gradients (HOG).

OpenCV-Python

Python is a general purpose programming language started by Guido van Rossum, which became very popular in
short time mainly because of its simplicity and code readability. It enables the programmer to express his ideas in
fewer lines of code without reducing any readability.

Compared to other languages like C/C++, Python is slower. But another important feature of Python is that it can be
easily extended with C/C++. This feature helps us to write computationally intensive codes in C/C++ and create a
Python wrapper for it so that we can use these wrappers as Python modules. This gives us two advantages: first, our
code is as fast as original C/C++ code (since it is the actual C++ code working in background) and second, it is very
easy to code in Python. This is how OpenCV-Python works, it is a Python wrapper around original C++
implementation. And the support of Numpy makes the task more easier. Numpy is a highly optimized library for
numerical operations. It gives a MATLAB-style syntax. All the OpenCV array structures are converted to-and-from
Numpy arrays. So whatever operations you can do in Numpy, you can combine it with OpenCV, which increases
number of weapons in your arsenal. Besides that, several other libraries like SciPy, Matplotlib which supports
Numpy can be used with this. So OpenCV-Python is an appropriate tool for fast prototyping of computer vision
problems.

OpenCV-Python working

OpenCV introduces a new set of tutorials which will guide you through various functions available in OpenCV-
Python. This guide is mainly focused on OpenCV 3.x version (although most of the tutorials will work with
OpenCV 2.x also).

A prior knowledge on Python and Numpy is required before starting because they won’t be covered in this
guide. Especially, a good knowledge on Numpy is must to write optimized codes in OpenCV-Python.
This tutorial has been started by Abid Rahman K. as part of Google Summer of Code 2013 program, under the
guidance of Alexander Mordvintsev.

OpenCV Needs us..


Since OpenCV is an open source initiative, all are welcome to make contributions to this library. And it is
same for this tutorial also. So, if you find any mistake in this tutorial (whether it be a small spelling mistake or a big
error in code or concepts, whatever), feel free to correct it. And that will be a good task for freshers who begin to
contribute to open source projects. Just fork the OpenCV in github, make necessary corrections and send a pull
request to OpenCV. OpenCV developers will check your pull request, give you important feedback and once it
passes the approval of the reviewer, it will be merged to OpenCV. Then you become a open source contributor.
Similar is the case with other tutorials, do

cumentation etc.

As new modules are added to OpenCV-Python, this tutorial will have to be expanded. So those who knows about
particular algorithm can write up a tutorial which includes a basic theory of the algorithm and a code showing basic
usage of the algorithm and submit it to OpenCV.

Getting Started with Images


Goals

 Here, you will learn how to read an image, how to display it and how to save it back
 You will learn these functions : cv2.imread(), cv2.imshow() , cv2.imwrite()
 Optionally, you will learn how to display images with Matplotlib

Using OpenCV

Read an image
Use the function cv2.imread() to read an image. The image should be in the working directory or a full path of
image should be given.

Second argument is a flag which specifies the way image should be read.

 cv2.IMREAD_COLOR : Loads a color image. Any transparency of image will be neglected. It is the default
flag.
 cv2.IMREAD_GRAYSCALE : Loads image in grayscale mode
 cv2.IMREAD_UNCHANGED : Loads image as such including alpha channel
Display an image
Use the function cv2.imshow() to display an image in a window. The window automatically fits to the image size.

First argument is a window name which is a string. second argument is our image. You can create as many windows
as you wish, but with different window ncv2.waitKey() is a keyboard binding function. Its argument is the time in
milliseconds. The function waits for specified milliseconds for any keyboard event. If you press any key in that
time, the program continues. If 0 is passed, it waits indefinitely for a key stroke. It can also be set to detect specific
key strokes like, if key a is pressed etc which we will discuss below.

cv2.destroyAllWindows() simply destroys all the windows we created. If you want to destroy any specific
window, use the function cv2.destroyWindow() where you pass the exact window name as the argument
Python Open CV

For the project, I chose to use Python Open CV as the programming language. It is a high-level language that
specializes in data analysis and computing mathematical problems. Python Open CV’s official website can be found
at www.mathworks.com.
The program environment has an interactive command window that allows users to test and experiment with the
code line by line. Users can also save their codes into an M-file and run the program. The Python Open CV Help
Navigator is also very useful. It properly categorizes and provides detailed explanations and sample usages of all
functions. Just like C++ and Java, the language syntax provides loops and condition statements for programming
purposes.
The language was chosen over C++ and Java because there are a lot of built-in functions that are specific for image
processing. As well, the compiler can compute large mathematical equations faster than other languages. These
advantages suit the project perfectly due to the large matrix computations required during the extraction process.

Fig PYTHON OPEN CV command window

There were some minor problems that occurred during the working of the project. The first problem was that Python
Open CV is a complete new language and environment for me. I had to get myself familiarized with Python Open
CV by practicing simple tutorials and exploring with the programming environment. Another problem that arose is
that Python Open CV takes a long time running the segmentation code. When compared to C++ and Java, Python
Open CV can calculate matrices quicker, but the large video files take a long time for a scripting language to
compile. Lastly, the Python Open CV software environment requires a lot of memory to run. During the process of
starting up and compiling, windows often cannot provide enough memory for Python Open CV and windows will
sometimes shutdown automatically.
Figure problem that arose on Python Open CV wind
Yolo For Object Classification:

YOLO (You Only Look Once), is a network for object detection. The object detection task consists in determining
the location on the image where certain objects are present, as well as classifying those objects. Previous methods
for this, like R-CNN and its variations, used a pipeline to perform this task in multiple steps. This can be slow to run
and also hard to optimize, because each individual component must be trained separately. YOLO, does it all with a
single neural network. From the paper:

We reframe the object detection as a single regression problem, straight from image pixels to bounding box
coordinates and class probabilities.

So, to put it simple, you take an image as input, pass it through a neural network that looks similar to a normal
CNN, and you get a vector of bounding boxes and class predictions in the output.

So, what do these predictions look like?


The Predictions Vector

The first step to understanding YOLO is how it encodes its output. The input image is divided into an S x S grid of
cells. For each object that is present on the image, one grid cell is said to be “responsible” for predicting it. That is
the cell where the center of the object falls into.
Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has 5
components: (x, y, w, h, confidence). The (x, y)coordinates represent the center of the box, relative to the grid cell
location (remember that, if the center of the box does not fall inside the grid cell, than this cell is not responsible for
it). These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0,
1], relative to the image size. Let’s look at an example:
Example of how to calculate box coordinates in a 448x448 image with S=3. Note how the (x,y) coordinates are
calculated relative to the center grid cell

There is still one more component in the bounding box prediction, which is the confidence score. From the paper:

Formally we define confidence as Pr(Object) * IOU(pred, truth) . If no object exists in that cell, the confidence
score should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between
the predicted box and the ground truth.
Note that the confidence reflects the presence or absence of an object of any class. In case you don't know what IOU
is, take a look here.
Now that we understand the 5 components of the box prediction, remember that each grid cell makes B of those
predictions, so there are in total S x S x B * 5 outputs related to bounding box predictions.
It is also necessary to predict the class probabilities, Pr(Class(i) | Object). This probability is conditioned on the grid
cell containing one object (see this if you don’t know that conditional probability means). In practice, it means that
if no object is present on the grid cell, the loss function will not penalize it for a wrong class prediction, as we will
see later. The network only predicts one set of class probabilities per cell, regardless of the number of boxes B. That
makes S x S x C class probabilities in total
Adding the class predictions to the output vector, we get a S x S x (B * 5 +C) tensor as output.
Each grid cell makes B bounding box predictions and C class predictions (S=3, B=2 and C=3 in this example)
The Network

Once you understand how the predictions are encoded, the rest is easy. The network structure looks like a normal
CNN, with convolutional and max pooling layers, followed by 2 fully connected layers in the end:

┌────────────┬────────────────────────┬───────────────────┐
│ Name │ Filters │ Output Dimension │
├────────────┼────────────────────────┼───────────────────┤
│ Conv 1 │ 7 x 7 x 64, stride=2 │ 224 x 224 x 64 │
│ Max Pool 1 │ 2 x 2, stride=2 │ 112 x 112 x 64 │
│ Conv 2 │ 3 x 3 x 192 │ 112 x 112 x 192 │
│ Max Pool 2 │ 2 x 2, stride=2 │ 56 x 56 x 192 │
│ Conv 3 │ 1 x 1 x 128 │ 56 x 56 x 128 │
│ Conv 4 │ 3 x 3 x 256 │ 56 x 56 x 256 │
│ Conv 5 │ 1 x 1 x 256 │ 56 x 56 x 256 │
│ Conv 6 │ 1 x 1 x 512 │ 56 x 56 x 512 │
│ Max Pool 3 │ 2 x 2, stride=2 │ 28 x 28 x 512 │
│ Conv 7 │ 1 x 1 x 256 │ 28 x 28 x 256 │
│ Conv 8 │ 3 x 3 x 512 │ 28 x 28 x 512 │
│ Conv 9 │ 1 x 1 x 256 │ 28 x 28 x 256 │
│ Conv 10 │ 3 x 3 x 512 │ 28 x 28 x 512 │
│ Conv 11 │ 1 x 1 x 256 │ 28 x 28 x 256 │
│ Conv 12 │ 3 x 3 x 512 │ 28 x 28 x 512 │
│ Conv 13 │ 1 x 1 x 256 │ 28 x 28 x 256 │
│ Conv 14 │ 3 x 3 x 512 │ 28 x 28 x 512 │
│ Conv 15 │ 1 x 1 x 512 │ 28 x 28 x 512 │
│ Conv 16 │ 3 x 3 x 1024 │ 28 x 28 x 1024 │
│ Max Pool 4 │ 2 x 2, stride=2 │ 14 x 14 x 1024 │
│ Conv 17 │ 1 x 1 x 512 │ 14 x 14 x 512 │
│ Conv 18 │ 3 x 3 x 1024 │ 14 x 14 x 1024 │
│ Conv 19 │ 1 x 1 x 512 │ 14 x 14 x 512 │
│ Conv 20 │ 3 x 3 x 1024 │ 14 x 14 x 1024 │
│ Conv 21 │ 3 x 3 x 1024 │ 14 x 14 x 1024 │
│ Conv 22 │ 3 x 3 x 1024, stride=2 │ 7 x 7 x 1024 │
│ Conv 23 │ 3 x 3 x 1024 │ 7 x 7 x 1024 │
│ Conv 24 │ 3 x 3 x 1024 │ 7 x 7 x 1024 │
│ FC 1 │ - │ 4096 │
│ FC 2 │ - │ 7 x 7 x 30 (1470) │
└────────────┴────────────────────────┴───────────────────┘

Some comments about the architecture:

 Note that the architecture was crafted for use in the Pascal VOC dataset, where the authors used S=7, B=2 and
C=20. This explains why the finalfeature maps are 7x7, and also explains the size of the output (7x7x(2*5+20)).
Use of this network with a different grid size or different number of classes might require tuning of the layer
dimensions.

 The authors mention that there is a fast version of YOLO, with fewer convolutional layers. The table above,
however, display the full version.

 The sequences of 1x1 reduction layers and 3x3 convolutional layers were inspired by the GoogLeNet (Inception)
model

 The final layer uses a linear activation function. All other layers use a leaky RELU (Φ(x) = x, if x>0; 0.1x
otherwise)
 If you are not familiar with convolutional networks, take a look at this great introduction
The Loss Function

There is a lot to say about the loss function, so let's do it by parts. It starts like this:

YOLO Loss Function — Part 1


This equation computes the loss related to the predicted bounding box position (x,y). Don’t worry about λ for now,
just consider it a given constant. The function computes a sum over each bounding box predictor (j = 0.. B) of each
grid cell (i = 0 .. S^2). 𝟙 obj is defined as follows:
 1, If an object is present in grid cell i and the jth bounding box predictor is “responsible” for that prediction
 0, otherwise

But how do we know which predictor is responsible for the object? Quoting the original paper:

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to
be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on
which prediction has the highest current IOU with the ground truth.
The other terms in the equation should be easy to understand: (x, y) are the predicted bounding box position and (x̂ ,
ŷ) hat are the actual position from the training data.

Let’s move on to the second part:

YOLO Loss Function — Part 2

This is the loss related to the predicted box width / height. The equation looks similar to the first one, except for the
square root. What’s up with that? Quoting the paper again:

Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially
address this we predict the square root of the bounding box width and height instead of the width and height
directly.

Moving on to the third part:

YOLO Loss Function — Part 3


Here we compute the loss associated with the confidence score for each bounding box predictor. C is the confidence
score and Ĉ is the intersection over union of the predicted bounding box with the ground truth.𝟙 obj is equal to one
when there is an object in the cell, and 0 otherwise. 𝟙 noobj is the opposite.
The λ parameters that appear here and also in the first part are used to differently weight parts of the loss functions.
This is necessary to increase model stability. The highest penalty is for coordinate predictions (λ coord = 5) and the
lowest for confidence predictions when no object is present (λ noobj= 0.5).

The last part of the loss function is the classification loss:

YOLO Loss Function — Part 4


It looks similar to a normal sum-squared error for classification, except for the 𝟙 obj term. This term is used because
so we don’t penalize classification error when no object is present on the cell (hence the conditional class
probability discussed earlier).
The Training

The authors describe the training in the following way

 First, pretrain the first 20 convolutional layers using the ImageNet 1000-class competition dataset, using a input size
of 224x224

 Then, increase the input resolution to 448x448

 Train the full network for about 135 epochs using a batch size of 64, momentum of 0.9 and decay of 0.0005

 Learning rate schedule: for the first epochs, the learning rate was slowly raised from 0.001 to 0.01. Train for about
75 epochs and then start decreasing it.

 Use data augmentation with random scaling and translations, and randomly adjusting exposure and saturation.

Haar Casade
Theory

Object Detection using Haar feature-based cascade classifiers is an effective object detection method proposed by
Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features"
in 2001. It is a machine learning based approach where a cascade function is trained from a lot of positive and
negative images. It is then used to detect objects in other images.

Here we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and
negative images (images without faces) to train the classifier. Then we need to extract features from it. For this,
Haar features shown in the below image are used. They are just like our convolutional kernel. Each feature is a
single value obtained by subtracting sum of pixels under the white rectangle from sum of pixels under the black
rectangle.

image

Now, all possible sizes and locations of each kernel are used to calculate lots of features. (Just imagine how much
computation it needs? Even a 24x24 window results over 160000 features). For each feature calculation, we need to
find the sum of the pixels under white and black rectangles. To solve this, they introduced the integral image.
However large your image, it reduces the calculations for a given pixel to an operation involving just four pixels.
Nice, isn't it? It makes things super-fast.

But among all these features we calculated, most of them are irrelevant. For example, consider the image below.
The top row shows two good features. The first feature selected seems to focus on the property that the region of the
eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that
the eyes are darker than the bridge of the nose. But the same windows applied to cheeks or any other place is
irrelevant. So how do we select the best features out of 160000+ features? It is achieved by Adaboost.
image

For this, we apply each and every feature on all the training images. For each feature, it finds the
best threshold which will classify the faces to positive and negative. Obviously, there will be errors or
misclassifications. We select the features with minimum error rate, which means they are the features that most
accurately classify the face and non-face images. (The process is not as simple as this. Each image is given an equal
weight in the beginning. After each classification, weights of misclassified images are increased. Then the same
process is done. New error rates are calculated. Also new weights. The process is continued until the required
accuracy or error rate is achieved or the required number of features are found).

The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the
image, but together with others forms a strong classifier. The paper says even 200 features provide detection with
95% accuracy. Their final setup had around 6000 features. (Imagine a reduction from 160000+ features to 6000
features. That is a big gain).

So now you take an image. Take each 24x24 window. Apply 6000 features to it. Check if it is face or not. Wow..
Isn't it a little inefficient and time consuming? Yes, it is. The authors have a good solution for that.

In an image, most of the image is non-face region. So it is a better idea to have a simple method to check if a
window is not a face region. If it is not, discard it in a single shot, and don't process it again. Instead, focus on
regions where there can be a face. This way, we spend more time checking possible face regions.

For this they introduced the concept of Cascade of Classifiers. Instead of applying all 6000 features on a window,
the features are grouped into different stages of classifiers and applied one-by-one. (Normally the first few stages
will contain very many fewer features). If a window fails the first stage, discard it. We don't consider the remaining
features on it. If it passes, apply the second stage of features and continue the process. The window which passes all
stages is a face region. How is that plan!

The authors' detector had 6000+ features with 38 stages with 1, 10, 25, 25 and 50 features in the first five stages.
(The two features in the above image are actually obtained as the best two features from Adaboost). According to
the authors, on average 10 features out of 6000+ are evaluated per sub-window.
So this is a simple intuitive explanation of how Viola-Jones face detection works. Read the paper for more details or
check out the references in the Additional Resources section.
System Testing
Testing is the process of evaluating a system or its component(s) with the intent to find whether it satisfies the
specified requirements or not. Testing is executing a system in order to identify any gaps, errors, or missing
requirements in contrary to the actual requirements.

Testing Principle

Before applying methods to design effective test cases, a software engineer must understand the basic principle
that guides software testing. All the tests should be traceable to customer requirements.

Testing Methods

There are different methods that can be used for software testing. They are,

1. Black-Box Testing

The technique of testing without having any knowledge of the interior workings of the application is
called black-box testing. The tester is oblivious to the system architecture and does not have access to
the source code. Typically, while performing a black-box test, a tester will interact with the system's user
interface by providing inputs and examining outputs without knowing how and where the inputs are
worked upon.

2. White-Box Testing

White-box testing is the detailed investigation of internal logic and structure of the code. White-box
testing is also called glass testing or open-box testing. In order to perform white-box testing on an
application, a tester needs to know the internal workings of the code. The tester needs to have a look
inside the source code and find out which unit/chunk of the code is behaving inappropriately.
Levels of Testing

There are different levels during the process of testing. Levels of testing include different methodologies that
can be used while conducting software testing. The main levels of software testing are:

 Functional Testing:

This is a type of black-box testing that is based on the specifications of the software that is to be tested.
The application is tested by providing input and then the results are examined that need to conform to
the functionality it was intended for. Functional testing of software is conducted on a complete,
integrated system to evaluate the system's compliance with its specified requirements. There are five
steps that are involved while testing an application for functionality.

 The determination of the functionality that the intended application is meant to perform.

 The creation of test data based on the specifications of the application.

 The output based on the test data and the specifications of the application.

 The writing of test scenarios and the execution of test cases.

 The comparison of actual and expected results based on the executed test cases.

 Non-functional Testing

This section is based upon testing an application from its non-functional attributes. Non-functional
testing involves testing software from the requirements which are non-functional in nature but important
such as performance, security, user interface, etc. Testing can be done in different levels of SDLC. Few
of them are
Unit Testing

Unit testing is a software development process in which the smallest testable parts of an application, called
units, are individually and independently scrutinized for proper operation. Unit testing is often automated but it
can also be done manually. The goal of unit testing is to isolate each part of the program and show that
individual parts are correct in terms of requirements and functionality. Test cases and results are shown in the
Tables.
Unit Testing Benefits

 Unit testing increases confidence in changing/ maintaining code.


 Codes are more reusable.
 Development is faster.
 The cost of fixing a defect detected during unit testing is lesser in comparison to that
of defects detected at higher levels.
 Debugging is easy.
 Codes are more reliable.

Unit testing:

Sl # Test Case : - UTC-1


Name of Test: - Image or video capture
Items being tested: - Input Image
Sample Input: - Camera Stream
Should Capture input image
Expected output: -

Image Captured Successful


Actual output: -

Remarks: - Pass.

Sl # Test Case : - UTC-2


Name of Test: - Object Detection
Items being tested: - Labelling
Sample Input: - Image or video
Objects like Human will be detected
Expected output: -

Objects Detected
Actual output: -

Remarks: - Test Passed

18
Integration Testing:

Integration testing is a level of software testing where individual units are combined and
tested as a group. The purpose of this level of testing is to expose faults in the interaction
between integrated units. Test drivers and test stubs are used to assist in Integration Testing.
Integration testing is defined as the testing of combined parts of an application to determine
if they function correctly. It occurs after unit testing and before validation testing. Integration
testing can be done in two ways: Bottom-up integration testing and Top-down integration
testing.

1. Bottom-up Integration

This testing begins with unit testing, followed by tests of progressively higher-level
combinations of units called modules or builds.

2. Top-down Integration
In this testing, the highest-level modules are tested first and progressively, lower-
level modules are tested thereafter.

In a comprehensive software development environment, bottom-up testing is usually


done first, followed by top-down testing. The process concludes with multiple tests of the
complete application, preferably in scenarios designed to mimic actual situations. Table 8.3.2
shows the test cases for integration testing and their results.

Sl # Test Case : - ITC-1


Name of Test: - Image Capture and automated object detection
Item being tested: - Capture Image and automated object classification
Sample Input: - Video Stream
Detect objects Human
Expected output: -

Functioned Properly
Actual output: -

Remarks: - Pass.

19
Sl # Test Case : - ITC-2
Name of Test: - Action recognition
Item being tested: - Human Action like walking
Sample Input: - Video Stream
Action recognition
Expected output: -

Functioned Properly
Actual output: -

Remarks: - Pass.

System testing:

System testing of software or hardware is testing conducted on a complete, integrated system


to evaluate the system's compliance with its specified requirements. System testing falls
within the scope of black-box testing, and as such, should require no knowledge of the inner
design of the code or logic. System testing is important because of the following reasons:

 System testing is the first step in the Software Development Life Cycle, where the
application is tested as a whole.

 The application is tested thoroughly to verify that it meets the functional and technical
specifications.

 The application is tested in an environment that is very close to the production


environment where the application will be deployed.

 System testing enables us to test, verify, and validate both the business requirements
as well as the application architecture.

20
System Testing is shown in below tables:

Sl # Test Case : - STC-1

Name of Test: - System testing in various versions of OS

Item being tested: - OS compatibility.

Sample Input: - Execute the program in windows XP/ Windows-7/8

Expected output: - Performance is better in windows-7

Actual output: - Same as expected output, performance is better in windows-7

Remarks: - Pass

Chapter 6:
21
Conclusion:
Human Activity Recognition System, we proposed a model trained using Convolutional neural network
(CNN) with spatiotemporal three-dimensional kernels on Kinetic data set to recognize almost 400 human
activities with satisfactory accuracy level. The designed system can be used to automatically categorizing
a dataset of videos on disk, training and monitoring a new employee to correctly perform a task, verify
food worker services, monitoring bar/restaurants patrons and ensuring they are well served. We used a
dataset covering more than 100 activities in CNN model to make the system more versatile. It is also
observed that increasing the number of samples for an activity in the dataset improves the performance .

FUTURE SCOPE

Activity recognition is the basis for the development of many potential applications in health,
wellness, or sports
1. HAR can be used for health monitoring which can be achieved by analyzing the activity of a
person from the information collected by different devices.
2. HAR is used to discover similar patterns which are the variables that determine which activity
the human performs. 3. HAR can be used for robotic automation which makes it easier to train a
robot to interact with human and the objects.

Performance Analysis:

22
CNN: [Next comes the Convolutional Neural Network (CNN, or ConvNet) which is a class of deep
neural networks that are most commonly applied to analyzing visual imagery. Their other applications
include video understanding, speech recognition, and understanding natural language processing.[8]]
Advantages: [The usage of CNNs are motivated by the fact that they can capture / are able to learn
relevant features from an image /video at different levels similar to a human brain. This is feature
learning! Conventional neural networks cannot do this. Another main feature of CNNs is weight sharing.
Let’s take an example to explain this. Say you have a one layered CNN with 10 filters of size 5x5. Now
you can simply calculate parameters of such a CNN, it would be 5*5*10 weights and 10 biases i.e., 5*
5*10 + 10 = 260 parameters. Now let’s take a simple one layered NN with 250 neurons, here the number
of weight parameters depending on the size of images is ‘250 x K’ where size of the image is P X M and
K = (P *M). Additionally, you need ‘M’ biases. For the MNIST data as input to such a NN we will have
(250*784+1 = 19601) parameters.

Clearly, CNN is more efficient in terms of memory and complexity. Imagine NNs and CNNs with billions
of neurons, then CNNs would be less complex and saves memory compared to the NN. In terms of
performance, CNNs outperform NNs on conventional image recognition tasks and many other tasks.
Look at the Inception model, Resnet50 and many others for instance. For a completely new task /
problem CNNs are very good feature extractors. This means that you can extract useful attributes  from
an already trained CNN with its trained weights by feeding your data on each level and tune the CNN a
bit for the specific task. E.g.: Add a classifier after the last layer with labels specific to the task. This is
also called pre-training and CNNs are very efficient in such tasks compared to NNs. Another advantage
of this pre-training is we avoid training of CNN and save memory, time. The only thing you have to train
is the classifier at the end for your labels.[14]]

Disadvantages:

[A Convolutional neural network is significantly slower due to an operation such as maxpool.


 If the CNN has several layers then the training process takes a lot of time if the computer doesn’t
consist of a good GPU.
 A ConvNet requires a large Dataset to process and train the neural network.[15]]
 [Also, LSTM combined with Convolutional Neural Networks (CNNs) improved automatic image
captioning like those are seen in Facebook. Thus, you can see that RNN is more like helping us in data
processing predicting our next step whereas CNN helps us in visuals analysis. Though RNNs operate over
sequences of vectors: sequences in the input, the output, or in the most general case both in comparison
with CNN which not only have constrained Application Programming Interface (API) but also fixed
amount of computational steps. This is why CNN is kind of more powerful now than RNN

23
Chapter 7:
BIBLIOGRAPHY
[1] Venkata Ramana, Lakshmi Prasanna “Human activity recognition using opencv”,
International journal of creative research thoughts(IJCRT),(2021).
[2] Lamiyah Khattar, Garima Aggarwal “Analysis of human activity recognition using deep
learning.” 2021 11th International Conference on Colud Computing, Data Science &
Engineering.
[3] Long Cheng,Yani Guan “Recognition of human activities using machine learning methods
with wearable sensors” IEEE Members Research and Development Departmentin 2017.
[4] Ran He, Zhenan Sun “Adversarial cross spectral face completion for NIR-VIS face
recognition.” IEEE paper received on January 2019.
[5] Jing Wang, Yu cheng “Walk and learn: facial attribute representation learning”,2016 IEEE
Conference on Computer Vision and Pattern Recognition.
[6] Yongjing lin,Huosheng xie “Face gender recognition based on face recognition feature
vectors”, International Conference on Information Systems and Computer Aided Education
(ICISCAE),(2020).
[7] Mohanad Babiker, Muhamed Zaharadeen “Automated Daily Human Activity
Recognition for Video Surveillance Using Neural Network.” International Conference
on Smart Instrumentation, Measurement and Applications(ICSIMA) 28- 30 November
2017.
[8] Neha Sana Ghosh, Anupam Ghosh “Detection of Human Activity by Widget.” 2020
8 th International Conference on Reliability, Infocom Technologies and Optimization(ICRITO)
June 4-5,2020.
[9] AN YANG, WANG KAN “Segmentation and Recognition of Basic and
Transitional Activities for Continuous Physical Human Activity” IEEE paper on
2016.
[10] AbdullahAl Fahim and Ki H. Chon, “Smartphone Based Human Activity
Recognition with Feature Selection and Dense Neural Network” International
Conference on Reliability, Infocom Technologies and Optimization(ICRITO),(2020).

24
APPENDIX A
ACRONYMS
AI: Artifical Intelligence
CV: Computer Vision
DL: Deep Learning
CNN: Convolutional Neural Network
ANN: Artifical Neural Network
ML: Machine Learning
IOT: Internet of Things
HIC: Human Computer Interaction
HAR: Human Activity Recognition
HOG: Histograms of Oriented Gradients
SVM: Support Vector Machine
NIR-VIS: Near Infrared-Visible

19

You might also like