Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
225 views

Distracted Driver Detection Using Deep Learning Methods

This document describes a research project that aims to detect distracted driving using deep learning methods. It was written by four students (Siddharth Singh, Aniruddh Dadhich, Pranav Mahajan, and Anurag Singh) at the National Institute of Technology, Hamirpur, under the guidance of Dr. Dharmendra Singh Yadav. The document provides background on the problem of distracted driving, reviews previous literature on detecting distracted driving, describes the dataset and proposed methodology using convolutional neural networks, and discusses initial results from training models on the Inception-v3 and ResNet-50 architectures.

Uploaded by

Siddharth Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views

Distracted Driver Detection Using Deep Learning Methods

This document describes a research project that aims to detect distracted driving using deep learning methods. It was written by four students (Siddharth Singh, Aniruddh Dadhich, Pranav Mahajan, and Anurag Singh) at the National Institute of Technology, Hamirpur, under the guidance of Dr. Dharmendra Singh Yadav. The document provides background on the problem of distracted driving, reviews previous literature on detecting distracted driving, describes the dataset and proposed methodology using convolutional neural networks, and discusses initial results from training models on the Inception-v3 and ResNet-50 architectures.

Uploaded by

Siddharth Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Distracted Driver Detection using Deep Learning

Methods

A Term Paper By:


Siddharth Singh(16447), Aniruddh Dadhich(16421), Pranav Mahajan(16403), Anurag
Singh(16404)

Mentored By: Dr. Dharmendra Singh Yadav


Assistant Professor, E&CED Department
National Institute of Technology, Hamirpur
Contents
1 Abstract 3

2 Introduction 3

3 Literature Review 4

4 Dataset Description 5

5 Proposed Method 6
5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Training: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6 Architecture 8
6.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Inception-v3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.3 Residual Network(ResNet-50) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.4 Xception Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7 Activation Function 9

8 Loss function 10
8.1 Mean Squared Error: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8.2 Likelihood Loss: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8.3 Log Loss (Cross Entropy Loss): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

9 Regularization 10

10 What’s Next? 11

11 Results 11

12 Conclusion 12

13 References 12

2
1 Abstract
An investigation by specialists at the Indian Institute of Technology Bombay (IITB) has indicated
how utilizing cell phones while driving can divert drivers and influence their capacity to deal with
dangerous circumstances. The consequences of the examination have demonstrated that both calling
and messaging while at the same time driving corrupts the presentation of a driver. As indicated by a
report by the World Health Organization, the ninth leading reason for the deaths are road accidents.
In India, a nation where a street mishap happens each moment and a life is lost at regular intervals,
a past report uncovered that 31% of the drivers who utilized a cell phone during driving met with
accidents. In the most recent decade alone, India lost 1.3 million individuals to street crashes and
another 5.3 million were crippled forever. Therefore considering the seriousness of the issue we present
a novel algorithm to detect distracted driving. We employ Convolutional Neural Networks(CNNs) to
identify and classify distracted driving from images taken from a camera fixed at the dashboard of the
four wheeler vehicle. We trained the model on Resnet50 and Inceptionv3 architectures and obtained an
accuracy of and respectively. Our model not only predicts distracted driving but promotes awareness
among drivers towards safe driving by giving them alerts.

2 Introduction
According to the World Health Organization (WHO) report, 1.35 million people worldwide die in traffic
accidents each year. That is nearly 3,700 people dying on the world’s roads every day. According
to National Crime Records Bureau and Ministry of Transport and Highway, 1214 road crashes occur
every day and 377 people die on an average in India. Also, the road accidents are leading cause of
death of people aged between 5 and 29 years old. The report also shows that one of the most common
cause of these increasing traffic accidents happening every year is distractions to the driver. The use
of a mobile phone while driving is widespread among young and novice drivers, adding further to
the already high risk of crash and death among these groups. Also, the drivers’ reaction times have
also been shown to be 50% slower with telephone use than without. According to NTHSA distracted
driving can be defined as ”any activity that diverts attention of the driver from the task of driving”
and can be classified into Manual, Visual and Cognitive distractions. Some examples of cognitive
distractions are daydreaming and lost in thoughts. Manual distractions include talking or texting
using mobile phones, eating, talking to passengers in the vehicle, drinking etc. and an example of
visual distraction is sleepiness. Situation in India is worse. While India has just 1% of the world’s
vehicles, India accounts for over 10% of global road crash fatalities. Within the report the SaveLIFE
Foundation highlights that 47% of people receive calls on their mobile phone while driving and 60% of
them don’t stop at a safe location before answering calls. Nowadays, an increasing number of modern
vehicles have Advanced Driver Assistance Systems (ADAS) such as stability control, traction control,
lane departure warning, adaptive cruise control and anti-lock brakes. But even the latest autonomous
vehicles today are not fully autonomous and require the driver to be ready to take control of the steering
wheel in an emergency and not to be distracted. Detection of Drivers’ attention is very important

3
system to integrate in ADAS technology. If the vehicles detect the driver’s divided attention and
warn the driver against it then the traffic accidents can be reduced to significant margin. The Focus
of this project is detecting driver’s distraction using Deep Learning Methods. Some convolutional
neural network approaches are applied. We have used different architectures, different activation and
loss functions to maximize the accuracy. We are addressing 10 types of distraction to drivers in this
project described in Dataset Description. The project is focused on real time detecting of distractions.
This project is limited to the software approach right now. The hardware integration of this project
is the next step.

3 Literature Review

We will review the relevant and significant works in the literature to detect distracted driving in this
section. In 2011, Zhang Proposed a method using the Hidden Conditional Random Fields model
based on face, mouth, and hand features to detect the use of the mobile phone [43] by creating a
dataset using a camera. Li explored the example of eye development of drivers in a procedure of crash
prevention from the back affected by driver interruption incited by mobile phones utilizing the driving
the test system of the Beijing Jiaotong University (BJTU), which was anticipated through a 300-degree
front/fringe field of view at a goals of 1400 1050 pixels, and the eye-following framework mounted
on the head, Eye following glasses SensoMotoric Instruments (SMI ETG), to gather eye development
information. Klaner considered the connection between the presentation of auxiliary errands, including
the utilization of mobile phones, and the danger of impacts and close to crashes, estimated through
various sensors, for example, accelerometers, cameras, worldwide situating frameworks, among others.
Atiquzzaman built up a calculation to identify driving conduct in two diverting assignments, sending
instant messages and eating or drinking. Utilizing a test system, subjects played out a progression
of assignments while driving the reproduction; at that point, 10 attributes identified with vehicle
increasing speed, guiding edge, speed, and so forth were utilized as information. At that point, three
information mining procedures were utilized: straight discriminant investigation (LDA), strategic
relapse (LR) and bolster vector machine (SVM). José Marı́a Celaya-Padilla presented a method to
detect distracted drivers using their cellphone is proposed. A convolutional Neural network is employed
with a camera fitted at the ceiling to get the images. The CNN is developed by the Inception V3 neural
network, being prepared to distinguish ”messaging and driving” subjects. The last CNN was prepared
and approved on a dataset of 85,401 pictures, accomplishing a zone under the bend (AUC) of 0.891 in
the preparation set, an AUC of 0.86 on a visually impaired test and an affectability estimation of 0.97 on
the visually impaired test. In 2017, Abouelnaga Made another dataset like the StateFarm’s dataset for
identifying driver attentiveness[2]. The creators proposed an answer utilizing a weighted get together
of five diverse Convolutional Neural Systems. The framework accomplished great characterization
exactness, yet, it’s unreasonably perplexing for ongoing identification. Baheti. tended to this issue of
unpredictability and decreased the number of parameters essentially and accomplished a precision of
95.54%.

4
4 Dataset Description
The dataset was acquired from the Statefarm Dataset available on kaggle. The total number of classes
are 10 and are described in the figure. The Dataset consists of 15000 train images with 1500 images
in each class. The validation dataset consists of 3000 images with 300 images in each class.

5
The dataset includes ten classes:
class 0: Driving safely.
class 1: Texting using right hand.
class 2: Talking on mobile phones with the right hand.
class 3: Texting using left hand.
class 4: Talking on mobile phones with the left hand.
class 5: Adjusting the radio.
class 6: Eating or drinking.
class 7: Turning back.
class 8: Hair or makeup.
class 9: Talking to the passenger.
The train-cross validation split is in the ratio 80:20.

5 Proposed Method
We propose a novel method for identification of distracted driving with the use of convolutional neural
Networks.

The method is shown in the figure above. The various implementational steps are given below:

1. The camera mounted on the dashboard gives us live images of the driver.

2. The images are passed onto the trained CNN model.

3. The trained model predicts the expected class (whether the driver is distracted or not) using
trained weights.

6
5.1 Preprocessing
1. The dataset originally consisted of images of size 480*640. The images are resized to a size of
224*224 because it is very computationally expensive to train a big image.

2. The images are grayscaled so that each individual pixel value ranges from 0 to 255. Images are
grayscaled to decrease computational complexity.

3. Images to Array: The images are then converted into numpy arrays to pass them onto the
model. We took 1500 images from each class into the train array and took 300 images from each
class into the test array. It is a supervised learning algorithm. In supervised learning we give the
network the inputs and the expected results. Model generates error by calculating the difference
between the measured value and the expected value. The change in weights is proportional to
the error produced.

4. Shuffling the array: We shuffle the array so that same kind of data doesn’t train the network
in a cluster so that the training process and weights updation is more uniform.

5. Vectorization: The train and test arrays are vectorized so that they can be passed onto the
network.

6. Normalization is a technique regularly implemented as part of statistics practise for gadget


mastering. The intention of normalization is to trade the values of numeric columns in the dataset
to a commonplace scale, with out distorting variations in the stages of values. For machine learn-
ing, every dataset does now not require normalization. It is needed simplest when capabilities
have extraordinary levels. For example, recall a statistics set containing functions, age(x1), and
income(x2). Where age tiers from 0–one hundred, whilst profits levels from zero–20,000 and
better. Income is about 1,000 times large than age and levels from 20,000–500,000. So, these
two features are in very specific levels. When we do further analysis, like multivariate linear
regression, as an instance, the attributed income will intrinsically have an effect on the end
result more due to its larger value. But this doesn’t necessarily mean it’s miles extra crucial as
a predictor.

7. One hot encoding: The y test and y train vectors are then encoded. For example, If y train
is 4 then it is encoded as 0000100000.

5.2 Training:
The given training vector and the cross validation vector are then passed onto the network. We
have used three architectures to train the given model i.e. Resnet50, Inceptionv3 and the Xception
Net. Inception CNN was chosen due the very fact that such CNN will be exported for low value
hardware like Raspberry Pi and may be deployed in humanoid platforms; moreover, the CNN design
acted as multiple convolution filters that were then applied to constant input; the results were then
concatenated and passed forward. inception V3 relies on a pattern recognition network, and it’s
designed to use lowest amounts of image pre-processing. every of the planned CNN layer reinforces
key features; the primary layer detects edges, and also the second tackles the general style, among
others

7
6 Architecture
6.1 Convolutional Neural Network
A Convolutional Neural System (ConvNet/CNN) is a Profound Learning calculation which can take
in an info picture, dole out significance (learnable loads and inclinations) to different angles/protests
in the picture and have the option to separate one from the other. The pre-preparing required
in a ConvNet is a lot of lower when contrasted with other grouping calculations. While in crude
strategies channels are hand-designed, with enough preparing, ConvNets can become familiar with
these channels/attributes. The main idea behind Convolutional Neural Network is to learn the feature
mapping of an image and exploit it to make more nuanced feature mapping.
The design of a ConvNet is practically equivalent to that of the availability example of Neurons in the
Human Cerebrum and was roused by the association of the Visual Cortex. Singular neurons react to
stimuli just in a confined area of the visual field known as the Receptive Field. An assortment of such
fields overlaps to cover the whole visual territory. Why ConvNets over Feed-Forward Neural
Nets?
A picture is only a network of pixel values, isn’t that so? So why not simply flatten the picture (for
example 4x4 picture network into a 16x1 vector) and feed it to a Staggered Perceptron for grouping
purposes? In instances of very essential basic binary images, the technique may show a normal
precision score while performing prediction of classes however would have almost no precision with
regards to complex pictures having pixel conditions all through. A ConvNet can effectively catch the
Spatial and Temporal conditions in a picture through the use of significant channels. The engineering
plays out a superior fitting to the picture dataset because of the decrease in the quantity of parameters
included and reusability of loads. At the end of the day, the system can be prepared to comprehend
the complexity of the picture better

6.2 Inception-v3 Architecture


Inception v3 is a widely-used image recognition model that has been shown to achieve more than
78.1% precision on the ImageNet dataset. The model is the perfection of numerous thoughts created by
various specialists throughout the years. The model itself is comprised of symmetric and awry building
squares, including convolutions, normal pooling, max pooling, concats, dropouts, and completely
associated layers
A high-level diagram of the model is shown below:-
The training accuracy attained by these two networks are shown in the table. In an Image classification
task, the size of salient feature can considerably vary within the image frame. Hence, deciding on a
fixed kernel size is rather difficult. Lager kernels are preferred for more global features that are
distributed over large area of the image, on the other hand smaller kernels provide good results in
detecting area specific features that are distributed across the image frame. For effective recognition
of such variable sized feature, we need kernels of different sizes. That is what Inception does.

8
6.3 Residual Network(ResNet-50)
ResNet, short for Residual Networks is an exemplary neural system utilized as a spine for some PC
vision tasks. This model was the victor of ImageNet challenge in 2015. The major achievement
with ResNet was it enabled us to prepare very profound neural systems with 150+layers effectively.
Preceding ResNet preparing extremely profound neural systems was troublesome because of the issue
of vanishing gradients.
AlexNet, the winner of ImageNet 2012 and the model that obviously kick started the emphasis on
profound learning had just 8 convolutional layers, the VGG organize had 19 and Origin or Google Net
had 22 layers and ResNet 152 had 152 layers.
Nonetheless, expanding system depth doesn’t work by just stacking layers together. Deep networks
are difficult to train in light of the famous gradient issue — as the gradient are back-propagated to
prior layers; rehashed duplication may make the gradient incredibly small. Accordingly, as the system
goes further, its performance gets saturated or even starts degrading rapidly. The ResNet-50 model
comprises of 5 phases each with a convolution and identity block. Every convolution block has 3
convolution layers and every identity block additionally has 3 convolution layers. The ResNet-50 has
more than 23 million trainable parameters.
Skip Connection ResNet first presented the idea of skip association. The chart underneath outlines
skip association. The figure on the left is stacking convolution layers together in a steady progression.
On the right we stack convolution layers as in the past however we now also add the original input to
the output of the convolution block. This is called skip connection.

6.4 Xception Net


The Xception architecture has 36 convolutional layers forming the feature extraction base of the
network. The 36 convolutional layers are structured into 14 modules, all of which have linear residual
connections around them, except for the first and last modules.
Xception architecture is a linear stack of depth wise separable convolution layers with residual con-
nections. This makes the architecture very easy to define and modify; it takes only 30 to 40 lines of
code using a high-level library such as Keras or Tensor Flow-Slim, not unlike an architecture such as
VGG-16 [18], but rather unlike architectures such as Inception V2 or V3 which are far more complex
to define. An open-source implementation of Xception using Keras and Tensor Flow is provided as
part of the Keras Applications module2, under the MIT license.

7 Activation Function
While building a neural network, it’s important to make the choice what activation function to use in
the hidden layer as well as at the output layer of the network. So, What is an activation function and
why to use them? Activation function basically decides, whether a neuron should be activated or not
.Whether the information that the neuron is receiving is relevant for the given information or should
it be ignored. It is done by calculating weighted sum and further adding bias with it.
Y = Activation[Σ(W eights ∗ Input) + Bias]
The purpose of the activation function is to introduce non-linearity into the output of a neuron.
Why do we need Non-linear activation functions? A neural network without an activation function is

9
essentially just a linear regression model. The activation function does the non-linear transformation
to the input making it capable to learn and perform more complex tasks. In this project we have used
Rectified Linear Unit activation function.
8 Loss function

Loss function is a method of evaluating how well your algorithm models your dataset. If your pre-
dictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll
output a lower number. As you change pieces of your algorithm to try and improve your model, your
loss function will tell you if you’re getting anywhere.

8.1 Mean Squared Error:


Mean Squared Error (MSE) is the workhorse of basic loss functions: it’s easy to understand and
implement and generally works pretty well. To calculate MSE, you take the difference between your
predictions and the ground truth, square it, and average it out across the whole dataset.

8.2 Likelihood Loss:


The likelihood function is also relatively simple, and is commonly used in classification problems. The
function takes the predicted probability for each input example and multiplies them. And although
the output isn’t exactly human interpretable, it’s useful for comparing models. For example, consider
a model that outputs probabilities of [0.4, 0.6, 0.9, 0.1] for the ground truth labels of [0, 1, 1, 0]. The
likelihood loss would be computed as (0.6) * (0.6) * (0.9) * (0.9) = 0.2916. Since the model outputs
probabilities for TRUE (or 1) only, when the ground truth label is 0 we take (1-p) as the probability.
In other words, we multiply the model’s outputted probabilities together for the actual outcomes.

8.3 Log Loss (Cross Entropy Loss):


Log Loss is a loss function also used frequently in classification problems, and is one of the most
popular measures for Kaggle competitions. It’s just a straightforward modification of the likelihood
function with logarithms.
Loss = −(y log(p) + (1 − y) log(1 − p))
This is actually exactly the same formula as the regular likelihood function, but with logarithms added
in. You can see that when the actual class is 1, the second half of the function disappears, and when
the actual class is 0, the first half drops. That way, we just end up multiplying the log of the actual
predicted probability for the ground truth class.

9 Regularization
Any modication we make to a learning algorithm that’s intended to reduce its generalization error but
not its training error is called regularization. Keeping the model simple enough by using regularization
techniques allows the network to generalize well on data points it hasn’t seen before.

10
1. Weight Penalty L1 and L2 : Weight penalty is standard way for regularization, widely used
in training other model types. It relies strongly on the implicit assumption that a model with small
weights is somehow simpler than a network with large weights. The penalties try to keep the weights
small or non-existent (zero) unless there are big gradients to counteract it, which makes models also
more interpretable. L2 norm penalizes the square value of the weight (which explains also the “2”
from the name) and tends to drive all the weights to smaller values. L1 norm penalizes the absolute
value of the weight (v- shape function) and tends to drive some weights to exactly zero (introducing
sparsity in the model), while allowing some weights to be big.
2. Dropout regularization: Dropout involves going over all the layers in a neural network and
setting probability of keeping a certain nodes or not. The probability of keeping each node is set at
random. You only decide of the threshold: a value that will determine if the node is kept or not. For
example, if you set the threshold to 0.7, then there is a probability of 30% that a node will be removed
from the network. Therefore, this will result in a much smaller and simpler neural network, as shown
below.
10 What’s Next?

This project focuses only on the software part of the distracted driver system only. This system has
the potential to be implemented in real cars, which can significantly reduce the traffic accidents caused
by driving distractions. So, the next step is to integrate the software part with the electronic hardware
model that is compatible with the ADAS system of automated cars. The integrated system will be
connected to car’s main microprocessor. Whenever the driver’s distraction will be detected through
the uploaded trained model, it will send signal to the microprocessor and the processor will send a
warning signal to the driver. Currently, the similarities in postures of different behaviours result in
incorrect classifications. The behaviours that have more misclassifications are ‘drinking’, ‘hair and
makeup’ and ‘texting on the phone – left’. In the future, we need to solve this misclassification issue.

11 Results

Inceptionv3 Resnet50 XceptionNet


Number of Training Images 15,000 15,000 15,000
Training Accuracy 98.27 99.37 98.28
Validation Accuracy 96.70 98.77 90.85
No of epochs 80 80 36
Table 1: A Table showing accuracies of different architecture used on Distracted Driving dataset.

11
Modifications in a model should also consider the dataset. Usually in Histopathology we get the data
set which is limited and meagre. So Data Augmentation plays a big role in this kind of situation and
we have new data generated from old data itself. Also in Histopathology, an important problem is
color consistency. Different variations of different level of illuminations must be compensated. These
problems are reduced by Color Normalization method.

12 Conclusion
In this paper a system for detection of distracted driver based on convolutional neural networks
technique was designed. The weight initialization method was random initialization. The training is
done by 2 kind of network. First network was Inception Net and second one was Residual Network
(ResNet50). The training accuracy attained by these two networks are shown in the table. In an Image
classification task, the size of salient feature can considerably vary within the image frame. Hence,
deciding on a fixed kernel size is rather difficult. Lager kernels are preferred for more global features
that are distributed over large area of the image, on the other hand smaller kernels provide good results
in detecting area specific features that are distributed across the image frame. For effective recognition
of such variable sized feature, we need kernels of different sizes. That is what Inception does. Neural
Networks are not good in being able to find a simpler mapping when it exists. When sometimes
input is just equal to output, the weights should be 1 and biases should be 0. But backpropogation
in neural networks makes this mapping complex instead of simply putting weights equal to 1. This
happens because of the vanishing gradient problem. As we go deeper in CNN, the derivative when
back-propagating to the initial layers becomes almost insignificant in value. ResNet addresses this
issue by introducing “shortcut-connections”.
13 References

1. Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical
Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image
Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes
in Computer Science, vol 9351. Springer, Cham

2. Zhengyang Wang, Shuiwang Ji (2018) Smoothed Dilated Convolutions for Improved Dense Pre-
diction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery Data Mining (pp. 2486-2495). 2018

3. Fisher Yu, Vladlen Koltun(2015) Multi-Scale Context Aggregation by Dilated Convolutions.


Published as a conference paper at ICLR 2016

4. Arnout C. Ruifrok, Ph.D., and Dennis A. Johnston, Ph.D. (2001)

5. Depthwise separable convolutions for machine learning - https://eli.thegreenplace.net/2018/depthwise-


separable-convolutions-for-machine-learning/

12

You might also like